TY - JOUR
T1 - Gammatone cepstral coefficients
T2 - Biologically inspired features for non-speech audio classification
AU - Valero, Xavier
AU - Alias, Francesc
PY - 2012
Y1 - 2012
N2 - In the context of non-speech audio recognition and classification for multimedia applications, it becomes essential to have a set of features able to accurately represent and discriminate among audio signals. Mel frequency cepstral coefficients (MFCC) have become a de facto standard for audio parameterization. Taking as a basis the MFCC computation scheme, the Gammatone cepstral coefficients (GTCCs) are a biologically inspired modification employing Gammatone filters with equivalent rectangular bandwidth bands. In this letter, the GTCCs, which have been previously employed in the field of speech research, are adapted for non-speech audio classification purposes. Their performance is evaluated on two audio corpora of 4 h each (general sounds and audio scenes), following two cross-validation schemes and four machine learning methods. According to the results, classification accuracies are significantly higher when employing GTCC rather than other state-of-the-art audio features. As a detailed analysis shows, with a similar computational cost, the GTCC are more effective than MFCC in representing the spectral characteristics of non-speech audio signals, especially at low frequencies.
AB - In the context of non-speech audio recognition and classification for multimedia applications, it becomes essential to have a set of features able to accurately represent and discriminate among audio signals. Mel frequency cepstral coefficients (MFCC) have become a de facto standard for audio parameterization. Taking as a basis the MFCC computation scheme, the Gammatone cepstral coefficients (GTCCs) are a biologically inspired modification employing Gammatone filters with equivalent rectangular bandwidth bands. In this letter, the GTCCs, which have been previously employed in the field of speech research, are adapted for non-speech audio classification purposes. Their performance is evaluated on two audio corpora of 4 h each (general sounds and audio scenes), following two cross-validation schemes and four machine learning methods. According to the results, classification accuracies are significantly higher when employing GTCC rather than other state-of-the-art audio features. As a detailed analysis shows, with a similar computational cost, the GTCC are more effective than MFCC in representing the spectral characteristics of non-speech audio signals, especially at low frequencies.
KW - Audio classification
KW - Gammatone cepstral coefficients
KW - audio scene recognition
KW - environmental sound
KW - feature extraction
UR - http://www.scopus.com/inward/record.url?scp=84870567358&partnerID=8YFLogxK
U2 - 10.1109/TMM.2012.2199972
DO - 10.1109/TMM.2012.2199972
M3 - Article
AN - SCOPUS:84870567358
SN - 1520-9210
VL - 14
SP - 1684
EP - 1689
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
IS - 6
ER -