Research Article|Articles in Press

Enhancing the Performance of Pathological Voice Quality Assessment System Through the Attention-Mechanism Based Neural Network

Published:January 31, 2023DOI:



      Doctors, nowadays, primarily use auditory-perceptual evaluation, such as the grade, roughness, breathiness, asthenia, and strain scale, to evaluate voice quality and determine the treatment. However, the results predicted by individual physicians often differ, because of subjective perceptions, and diagnosis time interval, if the patient's symptoms are hard to judge. Therefore, an accurate computerized pathological voice quality assessment system will improve the quality of assessment.


      This study proposes a self_attention-based system, with a deep learning technology, named self_attention-based bidirectional long-short term memory (SA BiLSTM). Different pitches [low, normal, high], and vowels [/a/, /i/, /u/], were added into the proposed model, to make it learn how professional doctors evaluate the grade, roughness, breathiness, asthenia, and strain scale, in a high dimension view.


      The experimental results showed that the proposed system provided higher performance than the baseline system. More specifically, the macro average of the F1 score, presented as decimal, was used to compare the accuracy of classification. The (G, R, and B) of the proposed system were (0.768±0.011, 0.820±0.009, and 0.815±0.009), which is higher than the baseline systems: deep neural network (0.395±0.010, 0.312±0.019, 0.321±0.014) and convolution neural network (0.421±0.052, 0.306±0.043, 0.3250±0.032) respectively.


      The proposed system, with SA BiLSTM, pitches, and vowels, provides a more accurate way to evaluate the voice. This will be helpful for clinical voice evaluations and will improve patients’ benefits from voice therapy.

      Key words

      To read this article in full you will need to make a payment

      Purchase one-time access:

      Academic & Personal: 24 hour online accessCorporate R&D Professionals: 24 hour online access
      One-time access price info
      • For academic or personal research use, select 'Academic and Personal'
      • For corporate R&D use, select 'Corporate R&D Professionals'


      Subscribe to Journal of Voice
      Already a print subscriber? Claim online access
      Already an online subscriber? Sign in
      Institutional Access: Sign in to ScienceDirect


        • Murry T.
        Clinical voice disorders: an interdisciplinary approach.
        Mayo Clinic Proceedings. 66. Elsevier, 1991: 656
        • Roy N
        • Merrill RM
        • Gray SD
        • et al.
        Voice disorders in the general population: prevalence, risk factors, and occupational impact.
        Laryngoscope. 2005; 115: 1988-1995
        • Oates J.
        Auditory-perceptual evaluation of disordered voice quality.
        Folia Phoniatr Logop. 2009; 61: 49-56
        • Dejonckere P
        • Wieneke G.
        GRBAS-scaling of pathological voices: reliability, clinical relevance, and differentiated correlation with acoustic measurements, especially with cepstral measurements.
        in: Proceedings of the 22th World Congress IALP. 1992
        • Bele IV.
        Reliability in perceptual analysis of voice quality.
        J Voice. 2005; 19: 555-573
        • De Bodt MS
        • Wuyts FL
        • Van de Heyning PH
        • et al.
        Test-retest study of the GRBAS scale: influence of experience and professional background on perceptual rating of voice quality.
        J Voice. 1997; 11: 74-80
        • Moro-Velázquez L
        • Gómez-García JA
        • Godino-Llorente JI
        • et al.
        Modulation spectra morphological parameters: a new method to assess voice pathologies according to the GRBAS scale.
        BioMed Res Int. 2015;
        • Sellars C
        • Stanton A
        • McConnachie A
        • et al.
        Reliability of perceptions of voice quality: evidence from a problem asthma clinic population.
        J Laryngol Otol. 2009; 123: 755-763
        • Wilcox KA
        • Horii Y.
        Age and changes in vocal jitter.
        J Gerontol. 1980; 35: 194-198
        • Brockmann M
        • Drinnan MJ
        • Storck C
        • et al.
        Reliable jitter and shimmer measurements in voice clinics: the relevance of vowel, gender, vocal intensity, and fundamental frequency effects in a typical clinical task.
        J Voice. 2011; 25: 44-53
        • Teixeira JP
        • Oliveira C
        • Lopes C.
        Vocal acoustic analysis–jitter, shimmer and hnr parameters.
        Procedia Technol. 2013; 9: 1112-1122
        • Rabinov CR
        • Kreiman J
        • Gerratt BR
        • et al.
        Comparing reliability of perceptual ratings of roughness and acoustic measures of jitter.
        J Speech Lang Hear Res. 1995; 38: 26-32
        • LeCun Y
        • Bengio Y
        • Hinton G.
        Deep learning.
        Nature. 2015; 521: 436-444
        • Fang S-H
        • Tsao Y
        • Hsiao M-J
        • et al.
        Detection of pathological voice using cepstrum vectors: a deep learning approach.
        J Voice. 2019; 33: 634-641
        • Hirano M.
        Psycho-acoustic evaluation of voice.
        Clini Exam Voice. 1981; : 81-84
      1. Hidaka S, Lee Y, Wakamiya K, et al. Automatic Estimation of Pathological Voice Quality Based on Recurrent Neural Network Using Amplitude and Phase Spectrogram. In INTERSPEECH. 2020:3880-3884.

        • Kojima T
        • Fujimura S
        • Hasebe K
        • et al.
        Objective assessment of pathological voice using artificial intelligence based on the GRBAS scale.
        J Voice. 2021;
        • Arias-Londoño JD
        • Gómez-García JA
        • Godino-Llorente JI.
        Multimodal and multi-output deep learning architectures for the automatic assessment of voice quality using the GRB scale.
        IEEE J Selec Top Signal Process. 2019; 14: 413-422
      2. García MA, Rosset AL. Deep Neural Network for Automatic Assessment of Dysphonia. arXiv preprint arXiv:2202.12957. 2022.

        • Fujimura S
        • Kojima T
        • Okanoue Y
        • et al.
        Classification of voice disorders using a one-dimensional convolutional neural network.
        J Voice. 2022; 36: 15-20
        • Moers C
        • Möbius B
        • Rosanowski F
        • et al.
        Vowel-and text-based cepstral analysis of chronic hoarseness.
        J Voice. 2012; 26: 416-424
        • Ricci-Maccarini A
        • Schindler A
        • Mozzanica F
        • et al.
        Validity, reliability and reproducibility of the “extended GRBAS scale,” a comprehensive perceptual evaluation of dysphonia.
        J Voice. 2022;
        • Bhuta T
        • Patrick L
        • Garnett JD.
        Perceptual evaluation of voice quality and its correlation with acoustic measurements.
        J Voice. 2004; 18: 299-304
        • Stráník A
        • Čmejla R
        • Vokřál J.
        Acoustic parameters for classification of breathiness in continuous speech according to the GRBAS scale.
        J Voice. 2014; 28 (653-e9)
        • Anand S
        • Skowronski MD
        • Shrivastav R
        • et al.
        Perceptual and quantitative assessment of dysphonia across vowel categories.
        J Voice. 2019; 33: 473-481
        • Fujiki RB
        • Thibeault SL.
        Examining relationships between GRBAS ratings and acoustic, aerodynamic and patient-reported voice measures in adults with voice disorders.
        J Voice. 2021;
        • Aires MM
        • Marinho CB
        • Souza CdSC
        Effect of endoscopic glottoplasty on acoustic measures and quality of voice: a systematic review and meta-analysis.
        J Voice. 2020;
        • Kuang J.
        Covariation between voice quality and pitch: revisiting the case of Mandarin creaky voice.
        J Acoust Soc Am. 2017; 142: 1693-1706
        • Laukkanen AM
        • Björkner E
        • Sundberg J.
        Throaty voice quality: subglottal pressure, voice source, and formant characteristics.
        J Voice. 2006; 20: 25-37
        • Vaswani A
        • Shazeer N
        • Parmar N
        • et al.
        Attention is all you need.
        Advan Neural Inform Process Syst. 2017; : 30
        • Gillioz A
        • Casas J
        • Mugellini E
        • et al.
        Overview of the transformer-based models for NLP Tasks.
        in: 15th IEEE Conference on Computer Science and Information Systems (FedCSIS). 2020: 179-183
      3. Woldert-Jokisz B. Saarbruecken voice database. 2007.

        • Fushiki T
        Computing. Estimation of prediction error by using K-fold cross-validation.
        Statist Comput. 2011; 21: 137-146
        • Yadav S
        • Shukla S.
        Analysis of k-fold cross-validation over hold-out validation on colossal datasets for quality classification.
        in: IEEE 6th International conference on advanced computing (IACC). 2016
        • Syed SA
        • Rashid M
        • Hussain S
        • et al.
        Comparative analysis of CNN and RNN for voice pathology detection.
        Biomed Res Int. 2021; : 2021
        • Islam MS
        • Parvez I
        • Deng H
        • et al.
        Performance comparison of heterogeneous classifiers for detection of Parkinson's disease using voice disorder (dysphonia).
        in: International Conference on Informatics, Electronics & Vision (ICIEV). 2014
        • Oh SL
        • Jahmunah V
        • Ooi CP
        • et al.
        Classification of heart sound signals using a novel deep WaveNet model.
        Comput Met Prog Biom. 2020; 196105604
        • Du J
        • Huo Q.
        A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distortions.
        in: Ninth annual conference of the international speech communication association (INTERSPEECH). 2008
        • Yang B
        • Wang L
        • Wong DF
        • et al.
        Context-aware self-attention networks for natural language processing.
        Neurocomputing. 2021; 458: 157-169
        • Han KJ
        • Prieto R
        • Ma T.
        State-of-the-art speech recognition using multi-stream self-attention with dilated 1d convolutions.
        IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). 2019: 54-61
        • Sharma S
        • Sharma S
        • Athaiya A.
        Activation functions in neural networks.
        Towards Data Sci. 2017; 6: 310-316
      4. Agarap AF. Deep learning using rectified linear units (relu). arXiv preprint arXiv: 1803.08375. 2018.

        • Zhang Z
        • Sabuncu M.
        Generalized cross entropy loss for training deep neural networks with noisy labels.
        Advan Neural Inform Process Syst. 2018; : 31
        • Feng L
        • Shu S
        • Lin Z
        • et al.
        Can cross entropy loss be robust to label noise?.
        in: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence (IJCAI). 2021: 2206-2212
        • Dozat T.
        Incorporating nesterov momentum into adam.
        in: International Conference on Learning Representations (ICLR). 2016
        • Moulines E
        • Charpentier F.
        Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones.
        Speech Comm. 1990; 9: 453-467
        • Dutoit T
        • Leich H.
        MBR-PSOLA: text-to-speech synthesis based on an MBE re-synthesis of the segments database.
        Speech Comm. 1993; 13: 435-440
        • Ko T
        • Peddinti V
        • Povey D
        • et al.
        Audio augmentation for speech recognition.
        in: Sixteenth Annual Conference Of The International Speech Communication Association (INTERSPEECH). 2015
        • Halpern BM
        • Fritsch J
        • Hermann E
        • et al.
        An objective evaluation framework for pathological speech synthesis.
        in: Speech Communication; 14th ITG Conference. 2021
        • Pedregosa F
        • Varoquaux G
        • Gramfort A
        • et al.
        Scikit-learn: machine learning in python.
        J Mach Learn Res. 2011; 12: 2825-2830
      5. Powers DM. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv preprint arXiv:2010.16061. 2020.

      6. Gowda T, You W, Lignos C, et al. Macro-average: rare types are important too. arXiv preprint arXiv:2104.05700. 2021.

        • Dacakis G
        The role of voice therapy in male-to-female transsexuals.
        Current Opinion Otolaryngo Head Neck Surgery. 2002; 10: 173-177
        • Van der Maaten L
        • Hinton G.
        Visualizing data using t-SNE.
        J Mach Learn Res. 2008; 9
        • Li LPH
        • Han JY
        • Zheng WZ
        • et al.
        Improved environment aware based noise reduction system for cochlear implant users based on a knowledge transfer approach: development and usability study.
        J Med Int Res. 2021; 23: e25460