TY - GEN
T1 - Adding singing capabilities to unit selection TTS through HNM-based conversion
AU - Freixes, Marc
AU - Socoró, Joan Claudi
AU - Alías, Francesc
N1 - Funding Information:
Marc Freixes thanks the support of the European Social Fund (ESF) and the Catalan Government (SUR/DEC) for the pre-doctoral FI grant No. 2016FI_B2 00094. This work has been partially funded by SUR/DEC (grant ref. 2014-SGR-0590). We also want to thank the people that took the perceptual test and Raúl Montaño for his help with the statistics.
Publisher Copyright:
© Springer International Publishing AG 2016.
PY - 2016
Y1 - 2016
N2 - Adding singing capabilities to a corpus-based concatenative text-to-speech (TTS) system can be addressed by explicitly collecting singing samples from the previously recorded speaker. However, this approach is only feasible if the considered speaker is also a singing talent. As an alternative, we consider appending a Harmonic plus Noise Model (HNM) speech-to-singing conversion module to a Unit Selection TTS (US-TTS) system. Two possible text-to-speech-to-singing synthesis approaches are studied: applying the speech-to-singing conversion to the US-TTS synthetic output, or implementing a hybrid US+HNM synthesis framework. The perceptual tests show that the speech-to-singing conversion yields similar singing resemblance than the natural version, but with lower naturalness. Moreover, no statistically significant differences are found between both strategies in terms of naturalness nor singing resemblance. Finally, the hybrid approach allows reducing more than twice the overall computational cost.
AB - Adding singing capabilities to a corpus-based concatenative text-to-speech (TTS) system can be addressed by explicitly collecting singing samples from the previously recorded speaker. However, this approach is only feasible if the considered speaker is also a singing talent. As an alternative, we consider appending a Harmonic plus Noise Model (HNM) speech-to-singing conversion module to a Unit Selection TTS (US-TTS) system. Two possible text-to-speech-to-singing synthesis approaches are studied: applying the speech-to-singing conversion to the US-TTS synthetic output, or implementing a hybrid US+HNM synthesis framework. The perceptual tests show that the speech-to-singing conversion yields similar singing resemblance than the natural version, but with lower naturalness. Moreover, no statistically significant differences are found between both strategies in terms of naturalness nor singing resemblance. Finally, the hybrid approach allows reducing more than twice the overall computational cost.
KW - Harmonic plus noise model
KW - Prosody modification
KW - Speech-to-singing
KW - Text-to-singing
KW - Unit-selection TTS
UR - http://www.scopus.com/inward/record.url?scp=84997269665&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-49169-1_4
DO - 10.1007/978-3-319-49169-1_4
M3 - Conference contribution
AN - SCOPUS:84997269665
SN - 9783319491684
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 33
EP - 43
BT - Advances in Speech and Language Technologies for Iberian Languages - 3rd International Conference, IberSPEECH 2016, Proceedings
A2 - Mateo, Carmen Garcia
A2 - Ortega, Alfonso
A2 - Abad, Alberto
A2 - Mamede, Nuno
A2 - Martínez Hinarejos, Carlos D.
A2 - Teixeira, Antonio
A2 - Batista, Fernando
A2 - Perdigao, Fernando
PB - Springer Verlag
T2 - 3rd International Conference on Advances in Speech and Language Technologies for Iberian Languages, IberSPEECH 2016
Y2 - 23 November 2016 through 25 November 2016
ER -