TY - JOUR
T1 - MRI-Based Vocal Tract Representations for the Three-Dimensional Finite Element Synthesis of Diphthongs
AU - Arnela, Marc
AU - Dabbaghchian, Saeed
AU - Guasch, Oriol
AU - Engwall, Olov
N1 - Funding Information:
Manuscript received March 21, 2019; revised July 15, 2019 and September 10, 2019; accepted September 14, 2019. Date of publication September 19, 2019; date of current version October 3, 2019. This work was supported by the Agencia Estatal de Investigación (AEI) and FEDER, EU, through project GENIOVOX TEC2016-81107-P. The work of O. Guasch was supported in part by l’Obra Social de la Caixa and the Universitat Ramon Llull under Grant 2018-URL-IR2nQ-031. The Associate Editor coordinating the review of this manuscript and approving it for publication was Dr. Heiga Zen. (Corresponding author: Marc Arnela.) M. Arnela and O. Guasch are with the GTM - Grup de recerca en Tecnologies Mèdia, La Salle, Universitat Ramon Llull, 08022 Barcelona, Spain (e-mail: marc.arnela@salle.url.edu; oriol.guasch@salle.url.edu).
Publisher Copyright:
© 2014 IEEE.
PY - 2019/12
Y1 - 2019/12
N2 - The synthesis of diphthongs in three-dimensions (3D) involves the simulation of acoustic waves propagating through a complex 3D vocal tract geometry that deforms over time. Accurate 3D vocal tract geometries can be extracted from Magnetic Resonance Imaging (MRI), but due to long acquisition times, only static sounds can be currently studied with an adequate spatial resolution. In this work, 3D dynamic vocal tract representations are built to generate diphthongs, based on a set of cross-sections extracted from MRI-based vocal tract geometries of static vowel sounds. A diphthong can then be easily generated by interpolating the location, orientation and shape of these cross-sections, thus avoiding the interpolation of full 3D geometries. Two options are explored to extract the cross-sections. The first one is based on an adaptive grid (AG), which extracts the cross-sections perpendicular to the vocal tract midline, whereas the second one resorts to a semi-polar grid (SPG) strategy, which fixes the cross-section orientations. The finite element method (FEM) has been used to solve the mixed wave equation and synthesize diphthongs [α i] and [α u] in the dynamic 3D vocal tracts. The outputs from a 1D acoustic model based on the Transfer Matrix Method have also been included for comparison. The results show that the SPG and AG provide very close solutions in 3D, whereas significant differences are observed when using them in 1D. The SPG dynamic vocal tract representation is recommended for 3D simulations because it helps to prevent the collision of adjacent cross-sections.
AB - The synthesis of diphthongs in three-dimensions (3D) involves the simulation of acoustic waves propagating through a complex 3D vocal tract geometry that deforms over time. Accurate 3D vocal tract geometries can be extracted from Magnetic Resonance Imaging (MRI), but due to long acquisition times, only static sounds can be currently studied with an adequate spatial resolution. In this work, 3D dynamic vocal tract representations are built to generate diphthongs, based on a set of cross-sections extracted from MRI-based vocal tract geometries of static vowel sounds. A diphthong can then be easily generated by interpolating the location, orientation and shape of these cross-sections, thus avoiding the interpolation of full 3D geometries. Two options are explored to extract the cross-sections. The first one is based on an adaptive grid (AG), which extracts the cross-sections perpendicular to the vocal tract midline, whereas the second one resorts to a semi-polar grid (SPG) strategy, which fixes the cross-section orientations. The finite element method (FEM) has been used to solve the mixed wave equation and synthesize diphthongs [α i] and [α u] in the dynamic 3D vocal tracts. The outputs from a 1D acoustic model based on the Transfer Matrix Method have also been included for comparison. The results show that the SPG and AG provide very close solutions in 3D, whereas significant differences are observed when using them in 1D. The SPG dynamic vocal tract representation is recommended for 3D simulations because it helps to prevent the collision of adjacent cross-sections.
KW - Finite Element Method
KW - Vocal tract acoustics
KW - adaptive grid
KW - diphthongs
KW - semi-polar grid
KW - speech synthesis
UR - http://www.scopus.com/inward/record.url?scp=85073632242&partnerID=8YFLogxK
U2 - 10.1109/TASLP.2019.2942439
DO - 10.1109/TASLP.2019.2942439
M3 - Article
AN - SCOPUS:85073632242
SN - 2329-9290
VL - 27
SP - 2173
EP - 2182
JO - IEEE/ACM Transactions on Audio Speech and Language Processing
JF - IEEE/ACM Transactions on Audio Speech and Language Processing
IS - 12
M1 - 8844836
ER -