TY - JOUR
T1 - Towards high-quality next-generation text-to-speech synthesis
T2 - A multidomain approach by automatic domain classification
AU - Alías, Francesc
AU - Sevillano, Xavier
AU - Socoró, Joan Claudi
AU - Gonzalvo, Xavier
N1 - Funding Information:
Manuscript received June 29, 2007; revised November 13, 2007. Published August 13, 2008 (projected). This work was supported in part by the IntegraTV-4all project under Grant FIT-350301-2004-2 of the Spanish Science and Technology Council. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Steve Renals.
PY - 2008/9
Y1 - 2008/9
N2 - This paper is a contribution to the recent advancements in the development of high-quality next generation text-tospeech (TTS) synthesis systems. Two of the hottest research topics in this area are oriented towards the improvement of speech expressiveness and flexibility of synthesis. In this context, this paper presents a new TTS strategy called multidomain TTS (MD-TTS) for synthesizing among different domains. Although the multidomain philosophy has been widely applied in spoken language systems, few research efforts have been conducted to extend it to the TTS field. To do so, several proposals are described in this paper. First, a text classifier (TC) is included in the classic TTS architecture in order to automatically conduct the selection of the most appropriate domain for synthesizing the input text. In contrast to classic topic text classification tasks, the MD-TTS TC should not only consider the contents of text but also its structure. To this end, this paper introduces a new text modeling scheme based on an associative relational network, which represents texts as a directional weighted word-based graph. The conducted experiments validate the proposal in terms of both objective (TC efficiency) and subjective (perceived synthetic speech quality) evaluation criteria.
AB - This paper is a contribution to the recent advancements in the development of high-quality next generation text-tospeech (TTS) synthesis systems. Two of the hottest research topics in this area are oriented towards the improvement of speech expressiveness and flexibility of synthesis. In this context, this paper presents a new TTS strategy called multidomain TTS (MD-TTS) for synthesizing among different domains. Although the multidomain philosophy has been widely applied in spoken language systems, few research efforts have been conducted to extend it to the TTS field. To do so, several proposals are described in this paper. First, a text classifier (TC) is included in the classic TTS architecture in order to automatically conduct the selection of the most appropriate domain for synthesizing the input text. In contrast to classic topic text classification tasks, the MD-TTS TC should not only consider the contents of text but also its structure. To this end, this paper introduces a new text modeling scheme based on an associative relational network, which represents texts as a directional weighted word-based graph. The conducted experiments validate the proposal in terms of both objective (TC efficiency) and subjective (perceived synthetic speech quality) evaluation criteria.
KW - Speech synthesis
KW - Text processing
UR - http://www.scopus.com/inward/record.url?scp=70350568926&partnerID=8YFLogxK
U2 - 10.1109/TASL.2008.925145
DO - 10.1109/TASL.2008.925145
M3 - Article
AN - SCOPUS:70350568926
SN - 1558-7916
VL - 16
SP - 1340
EP - 1354
JO - IEEE Transactions on Audio, Speech and Language Processing
JF - IEEE Transactions on Audio, Speech and Language Processing
IS - 7
ER -