ISSN: 1304-7191 | E-ISSN: 1304-7205
Parabolic Filter Mel Frequency Cepstral Coefficient and Fusion of Features for Speaker Age Classification
1Kocaeli University, Department of Electronics and Communication Engineering, KOCAELI
Sigma J Eng Nat Sci 2020; 38(4): 2177-2191
Full Text PDF

Abstract

Speech is an acoustic signal initiated at the inner end of the human vocal tract and radiated as an audio wave at the tip of the outer end. The structure and length of the vocal tract makes distinctions on features taken from speeches similar in content, but uttered by different speakers. As a person grows his/her vocal tract changes in length which in turn modifies speech characteristics gradually. The mel frequency cepstral coefficient (MFCC) which uses triangular band pass filter banks has been widely regarded as the most popular feature used in most speech processing applications. To improve the accuracy of speaker age classification a new spectral based feature set named as parabolic filter mel frequency cepstral coefficient (PFMFCC) is proposed in this study. PFMFCC uses parabolic band pass filter banks instead of the triangular ones. This feature extraction technique uses 30 parabolic band pass filter banks to extract 42 features from each speech frame of length 20 ms. These features are applied to three classical classifiers, namely the Gaussian mixture model (GMM), cosine score, and probabilistic linear discriminant analysis (PLDA). The aGender database consisting of 47 hours of German speech uttered by a total of 852 speakers is used in this study. The new PFMFCC feature achieved 51.01%, 56.01% and 58.14% accuracies with cosine score, GMM and PLDA classifiers respectively on the female dataset. Similarly it achieved 50.44%, 52.74% and 57.23% accuracies with cosine score, GMM and PLDA classifiers respectively on the male dataset. Using feature fusion of seven feature sets overall accuracies of 60.18%, 52.17% and 56.35% are obtained on cosine score, GMM and PLDA classifiers respectively for all the seven speaker age classes. The feature fusion has improved the overall accuracy by 2.55% using cosine score compared to a related speaker age classification study carried out on the same database previously.