Research Article|Articles in Press

GBNF-VAE: A Pathological Voice Enhancement Model Based on Gold Section for Bottleneck Feature With Variational Autoencoder



      Speech enhancement has become a promising technique to accommodate demands of the improvement in quality of a degraded speech signal. The main works now focus on separating normal speech from noise, but have neglected the low quality of impaired speech influenced by anomalous glottis flow. In order to effectively enhance the pathological speech, it is essential to design a separation mechanism for extracting high-dimensional timbre features and speech features separately to suppress low-dimensional noises.


      In this paper, we propose an enhancement model GBNF-VAE to extract timbre efficiently by reducing anomalous airflow noise interference, and by combining the semantic features with timbre features to synthesize the enhanced speech. In particular, the bottleneck feature can characterize the timbre by the controlled number of nodes through the Golden Section method, which effectively improves computational efficiency. In addition, variational autoencoder is adopted to extract semantic features which are combined with the previous timbre features to synthesize the enhanced speech.


      Finally, spectrum observation, objective indicators and subjective evaluation all show the outstanding performance of GBNF-VAE in pathological speech quality enhancement.

      Key Words

      To read this article in full you will need to make a payment

      Purchase one-time access:

      Academic & Personal: 24 hour online accessCorporate R&D Professionals: 24 hour online access
      One-time access price info
      • For academic or personal research use, select 'Academic and Personal'
      • For corporate R&D use, select 'Corporate R&D Professionals'


      Subscribe to Journal of Voice
      Already a print subscriber? Claim online access
      Already an online subscriber? Sign in
      Institutional Access: Sign in to ScienceDirect


        • Chai L.
        • Du J.
        • Liu Q.
        • et al.
        A cross-entropy-guided measure (CEGM) for assessing speech recognition performance and optimizing DNN-based speech enhancement.
        IEEE ACM Trans Audio Speech Lang Process. 2021; 29: 106-117
        • Lavanya T.
        • Nagarajan T.
        • Vijayalakshmi P.
        Multi-level single channel speech enhancement using a unified framework for estimating magnitude and phase spectra.
        IEEE ACM Trans Audio Speech Lang Process. 2020; 28: 1315-1327
        • Wood S.
        • Stahl J.
        • Mowlaee P.
        Binaural codebook-based speech enhancement with atomic speech presence probability.
        IEEE ACM Trans Audio Speech Lang Process. 2019; 27: 2150-2161
        • Chen J.
        • Liang Z.
        The application of deep neural network in speech enhancement processing.
        2018 5th International Conference on Information Science and Control Engineering (ICISCE). 2018; : 1263-1266
        • Han W.
        • Wu C.
        • Zhang X.
        • et al.
        Joint optimization of modified ideal radio mask and deep neural networks for monaural speech enhancement.
        2017 IEEE 9th International Conference on Communication Software and Networks (ICCSN). 2017; : 1070-1074
        • Chang O.
        • Tran D.
        • Koishida K.
        Single-channel speech enhancement using learnable loss mixup.
        Proc Interspeech 2021. 2021; : 2696-2700
        • Roy S.
        • Mukherjee A.
        • Saha G.
        Glottal source modeling in dysphonic speech.
        Speech Commun. 2019; 113: 136-145
        • Eraslan F.
        • ’ahin M.O.
        • Demirel H.
        Artificial neural network based voice pathology detection using time-domain features.
        Appl Soft Comput. 2021; 109: 1-11
        • Meltzner G.S.
        • Lulich S.M.
        • Bharucha J.J.
        A pilot study of pink noise therapy for improving speech in Parkinson’s disease.
        J Acoust Soc Am. 2017; 141: 2373-2383
        • Urbanowicz R.
        • Meeker M.
        • Lacava W.
        • et al.
        Relief-based feature selection: Introduction and review.
        J Biomed Inf. 2018; 85: 189-203
        • AlNuaimi N.
        • Masud M.
        • Serhani M.
        • et al.
        Streaming feature selection algorithms for big data: a survey.
        Appl Comput Inf. 2019;
        • Chandrashekar G.
        • Sahin F.
        A survey on feature selection methods.
        Comput Electr Eng. 2014; 1: 16-28
        • Saadoune A.
        • Amrouche A.
        • Selouani S.
        MCRA noise estimation for KLT-VRE-based speech enhancement.
        Int J Speech Technol. 2013; 16: 333-339
        • Tu Y.
        • Du J.
        • Gao T.
        • et al.
        A multi-target SNR-progressive learning approach to regression based speech enhancement.
        IEEE/ACM Trans Audio Speech Lang Process. 2020; 28: 1608-1619
        • Bhatt V.M.
        Gold section search algorithm for maximizing an object-oriented neural network-based cost function.
        IEEE Trans Syst Man Cybern Part C (Appl Rev). 1999; 29: 234-239
        • Lu Y.
        • Zhang Y.
        A new thresholding method based on the gold section search for image segmentation.
        IEEE Trans Image Process. 2011; 20: 1717-1726
        • Chakraborty S.R.
        • Parui S.K.
        Application of golden section search for optimization of fractal antenna design.
        Progr Electromagnet Res. 2012; 126: 355-374
        • Hasukawa A.
        • Mochizuki R.
        • Sakamoto H.
        • et al.
        Surgical effects of type-I thyroplasty and fat injection laryngoplasty on voice recovery.
        Auris Nasus Larynx. 2021; 48: 302-309
        • Ackerstaff A.
        • Hilgers F.
        • Aaronson N.
        • et al.
        Communication, functional disorders and lifestyle changes after total laryngectomy.
        Clin Otolaryngol. 1994; 19: 295-300
        • Snyder D.
        • Garcia-Romero D.
        • McCree A.
        • et al.
        Speaker verification using neural embedding of phonetic posterior features.
        IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 2018: 4874-4878
        • Yousefi M.A.
        • Sameti H.
        • Nadernejad E.
        Robust speaker recognition in noisy environments using convolutional neural networks.
        IEEE Signal Process Lett. 2018; 25: 1353-1357
        • Liu X.
        • Li X.
        • Zhou Y.
        • et al.
        Robust speaker recognition using attention-based deep neural networks.
        IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 2019
        • Liu X.
        • Li X.
        • Li H.
        • et al.
        Speaker recognition in adverse conditions using deep neural networks trained on synthetic noisy speech.
        IEEE J Select Topics Signal Process. 2019; 13: 364-374
        • Zhang T.
        • Shao Y.
        • Wu Y.
        • et al.
        Multiple vowels repair based on pitch extraction and line spectrum pair feature for voice disorder.
        IEEE J Biomed Health Inf. 2020; 24: 1940-1951
        • Jamaludin M.
        • Salleh S.
        • Swee T.
        • et al.
        An improved time domain pitch detection algorithm for pathological voice.
        Am J Appl Sci. 2012; 9: 93-102
        • Zhang T.
        • Liu X.
        • Liu G.
        • et al.
        PVR-AFM: a pathological voice repair system based on non-linear structure.
        J Voice. 2021;
        • Pietruch R.
        • Michalska M.
        • Konopka W.
        • et al.
        Methods for formant extraction in speech of patients after total laryngectomy.
        IEEE J Biomed Health Inf. 2006; 1: 107-112
        • Lee J.-H.
        • Kim J.-Y.
        • Kim S.
        • et al.
        Correlation between vocal function and quantitative videofluoroscopic analysis of swallowing in patients with vocal cord paralysis.
        Ann Otol Rhinol Laryngol. 2007; 116: 93-99
        • Hofheinz D.
        • Sch”utzenberger A.
        • Nawka T.
        • et al.
        Vocal fold polyps and their impact on the glottal airflow: an experimental study.
        Ann Otol Rhinol Laryngol. 2005; 114: 835-841
        • Story B.H.
        • Laukkanen A.-M.
        Visual assessment of laryngeal airflow in speech pathology.
        Int J Lang Commun Disord. 2000; 35: 401-415
        • Berouti M.
        • Schwartz R.
        • Makhoul J.
        Enhancement of speech corrupted by acoustic noise.
        Proc Int Conf Acoust Speech Signal Process (ICASS). 1979; 4: 208-211
        • Lim J.
        • Oppenheim A.
        All-pole modeling of degraded speech.
        IEEE Trans Acoust Speech Signal Process. 1978; 26: 197-210
        • Ephraim Y.
        Statistical-model-based speech enhancement systems.
        Proc IEEE. 1992; 80: 1526-1555
        • Dendrinos M.
        • Bakamidis S.
        • Carayannis G.
        Speech enhancement from noise: a regenerative approach.
        Speech Commun. 1991; 10: 45-57
        • Tamura S.
        • Waibel A.
        Noise reduction using connectionist models.
        Proc IEEE Int Conf Acoust Speech Signal Process (ICASSP). 1988; : 553-556
        • Parveen S.
        • Green P.
        Speech enhancement with missing data techniques using recurrent neural networks.
        Proc IEEE Int Conf Acoust Speech Signal Process (ICASSP). 2004; : 733-736
        • Lu X.
        • Tsao Y.
        • Matsuda S.
        • et al.
        Speech enhancement based on deep denoising autoencoder.
        Interspeech. 2013; : 436-440
        • Wang Z.-Q.
        • Wang D.
        Recurrent deep stacking networks for supervised speech separation.
        IEEE Int Conf Acoust Speech Signal Process. 2017; : 71-75
        • Zhao Z.
        • Elshamy S.
        • Fingscheidt T.
        A perceptual weighting filter loss for DNN training in speech enhancement.
        2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). 2019; : 229-233
        • Hsu W.-N.
        • Zhang Y.
        • Glass J.
        Learning latent representations for speech generation and transformation.
        Interspeech. 2017; : 1273-1277
        • Blaauw M.
        • Bonada J.
        Modeling and transforming speech using variational autoencoders.
        Interspeech. 2016; : 1770-1774
      1. Bando Y, Mimura M, Itoyama K, Yoshii K, Kawahara T. Statistical Speech Enhancement Based on Probabilistic Integration of Variational Autoencoder and Non-Negative Matrix Factorization, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 2018, pp. 716–720.

        • Veaux C
        • Yamagishi J
        • MacDonald K.
        • corpus VCTK
        English Multi-Peaker Corpus for CSTR Voice Cloning Toolkit[J]. University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2017
        • Varga A
        • Steeneken H J M
        Noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems[J].
        Speech communication. 1993; 12: 247-253
      2. Barry WJ, Ptzer M. Saarbrucken voice database. Institute of Phonetics, Universitt des Saarlandes; 2007.

      3. Kingma DP, Ba J. Adam: a method for stochastic optimization. arXiv:141269802014.