Research Article|Articles in Press

Detecting Vocal Fatigue with Neural Embeddings

Published:February 09, 2023DOI:


      Vocal fatigue refers to the feeling of tiredness and weakness of voice due to extended utilization. This paper investigates the effectiveness of neural embeddings for the detection of vocal fatigue. We compare x-vectors, ECAPA-TDNN, and wav2vec 2.0 embeddings on a corpus of academic spoken English. Low-dimensional mappings of the data reveal that neural embeddings capture information about the change in vocal characteristics of a speaker during prolonged voice usage. We show that vocal fatigue can be reliably predicted using all three types of neural embeddings after 40 minutes of continuous speaking when temporal smoothing and normalization are applied to the extracted embeddings. We employ support vector machines for classification and achieve accuracy scores of 81% using x-vectors, 85% using ECAPA-TDNN embeddings, and 82% using wav2vec 2.0 embeddings as input features. We obtain an accuracy score of 76%, when the trained system is applied to a different speaker and recording environment without any adaptation.


      To read this article in full you will need to make a payment

      Purchase one-time access:

      Academic & Personal: 24 hour online accessCorporate R&D Professionals: 24 hour online access
      One-time access price info
      • For academic or personal research use, select 'Academic and Personal'
      • For corporate R&D use, select 'Corporate R&D Professionals'


      Subscribe to Journal of Voice
      Already a print subscriber? Claim online access
      Already an online subscriber? Sign in
      Institutional Access: Sign in to ScienceDirect


        • Gotaas C.
        • Starr C.D.
        Vocal fatigue among teachers.
        Folia Phoniatrica et Logopaedica. 1993; 45: 120-129
        • Benninger M.S.
        The professional voice.
        J Laryngol Otol. 2010; 125: 111-116
        • Lehto L.
        • Laaksonen L.
        • Vilkman E.
        • et al.
        Changes in objective acoustic measurements and subjective voice complaints in call center customer-service advisors during one working day.
        J Voice. 2008; 22: 164-177
        • Welham N.V.
        • Maclagan M.A.
        Vocal fatigue: current knowledge and future directions.
        J Voice. 2003; 17: 21-30
        • Nanjundeswaran C.
        • Jacobson B.H.
        • Gartner-Schmidt J.
        • et al.
        Vocal fatigue index (VFI): development and validation.
        J Voice. 2015; 29: 433-440
        • Hunter E.J.
        • Cantor-Cutiva L.C.
        • van Leer E.
        • et al.
        Toward a consensus description of vocal effort, vocal load, vocal loading, and vocal fatigue.
        J Speech Lang Hear Res. 2020; 63: 509-532
        • Caraty M.-J.
        • Montacié C.
        Vocal fatigue induced by prolonged oral reading: analysis and detection.
        Comput Speech Lang. 2014; 28: 453-466
        • Baevski A.
        • Zhou Y.
        • Mohamed A.
        • et al.
        wav2vec 2.0: a framework for self-supervised learning of speech representations.
        in: Larochelle H. Ranzato M. Hadsell R. Balcan M.F. Lin H. Advances in Neural Information Processing Systems. vol. 33. Curran Associates, Inc., 2020: 12449-12460
        • Snyder D.
        • Garcia-Romero D.
        • Sell G.
        • et al.
        X-vectors: robust DNN embeddings for speaker recognition.
        2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2018: 5329-5333
        • Desplanques B.
        • Thienpondt J.
        • Demuynck K.
        ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification.
        Proc. Interspeech 2020. 2020: 3830-3834
        • van der Maaten L.
        • Hinton G.
        Visualizing data using t-sne.
        J Mach Learn Res. 2008; 9: 2579-2605
        • Boser B.E.
        • Guyon I.M.
        • Vapnik V.N.
        A training algorithm for optimal margin classifiers.
        Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT ’92, Association for Computing Machinery. New York, NY, USA. 1992: 144-152
        • Laukkanen A.-M.
        • Ilomäki I.
        • Leppänen K.
        • et al.
        Acoustic measures and self-reports of vocal fatigue by female teachers.
        J Voice. 2008; 22: 283-289
        • Remacle A.
        • Finck C.
        • Roche A.
        • et al.
        Vocal impact of a prolonged reading task at two intensity levels: objective measurements and subjective self-ratings.
        J Voice. 2012; 26: e177-e186
        • Solomon N.P.
        • Glaze L.E.
        • Arnold R.R.
        • et al.
        Effects of a vocally fatiguing task and systemic hydration on men’s voices.
        J Voice. 2003; 17: 31-46
        • Carroll T.
        • Nix J.
        • Hunter E.
        • et al.
        Objective measurement of vocal fatigue in classical singers: a vocal dosimetry pilot study, otolaryngol.
        Head Neck Surg. 2006; 135: 595-602
        • Lei Z.
        • Fasanella L.
        • Martignetti L.
        • et al.
        Investigation of vocal fatigue using a dose-based vocal loading task.
        Appl Sci (Basel). 2020; 10: 1192
        • Caraty M.-J.
        • Montacié C.
        Multivariate analysis of vocal fatigue in continuous reading.
        INTERSPEECH. 2010
        • Shen Z.
        • Wei Y.
        A high-precision feature extraction network of fatigue speech from air traffic controller radiotelephony based on improved deep learning.
        ICT Expr. 2021; 7: 403-413
        • Gao Y.
        • Dietrich M.
        • DeSouza G.N.
        Classification of vocal fatigue using semg: data imbalance, normalization, and the role of vocal fatigue index scores.
        Appl Sci. 2021; 11
      1. Baevski A., Hsu W.-N., Conneau A., et al. Unsupervised speech recognition. 2021. ArXiv:2105.11084 [cs, eess]ArXiv: 2105.11084.

        • Snyder D.
        • Garcia-Romero D.
        • McCree A.
        • et al.
        Spoken language recognition using x-vectors.
        Proc. The Speaker and Language Recognition Workshop (Odyssey 2018). 2018: 105-111
      2. Tjandra A., Choudhury D.G., Zhang F., et al. Improved language identification through cross-lingual self-supervised learning. 2021. ArXiv:2107.04082.

        • Fan Z.
        • Li M.
        • Zhou S.
        • et al.
        Exploring wav2vec 2.0 on speaker verification and language identification.
        Proc. Interspeech 2021. 2021: 1509-1513
        • Pepino L.
        • Riera P.
        • Ferrer L.
        Emotion recognition from speech using wav2vec 2.0 embeddings.
        Interspeech 2021, ISCA. 2021: 3400-3404
        • Bayerl S.P.
        • Wagner D.
        • Noeth E.
        • et al.
        Detecting dysfluencies in stuttering therapy using wav2vec 2.0.
        Interspeech 2022, ISCA. 2022: 2868-2872
        • Dehak N.
        • Kenny P.J.
        • Dehak R.
        • et al.
        Front-end factor analysis for speaker verification.
        IEEE Transactions on Audio, Speech, and Language Processing. vol. 19. 2011: 788-798
        • Xue C.
        • Kang J.
        • Hedberg C.
        • et al.
        Dynamically monitoring vocal fatigue and recovery using aerodynamic, acoustic, and subjective self-rating measurements.
        J Voice. 2019; 33: 809.e11-809.e18
        • D’haeseleer E.
        • Behlau M.
        • Bruneel L.
        • et al.
        Factors involved in vocal fatigue: a pilot study.
        Folia Phoniatr Logop. 2016; 68: 112-118
        • Riedhammer K.
        • Gropp M.
        • Bocklet T.
        • et al.
        Lmelectures: a multimedia corpus of academic spoken english.
        Proc. First Workshop on Speech, Language and Audio in Multimedia (SLAM), Curran Associates. 2013: 102-107
        • Bengio Y.
        • Courville A.
        • Vincent P.
        Representation learning: a review and new perspectives.
        IEEE Trans Pattern Anal MachIntell. 2013; 35: 1798-1828
      3. IEEE Catalog No.: CFP11SRW-USB
        • Nagrani A.
        • Chung J.S.
        • Zisserman A.
        Voxceleb: a large-scale speaker identification dataset.
        Proc. Interspeech 2017. 2017: 2616-2620
      4. Snyder D., Chen G., Povey D.. MUSAN: a music, speech, and noise corpus, arxiv:1510.08484v1. 2015. ArXiv:1510.08484.

        • Ko T.
        • Peddinti V.
        • Povey D.
        • et al.
        A study on data augmentation of reverberant speech for robust speech recognition.
        2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2017: 5220-5224
        • Gao S.
        • Cheng M.
        • Zhao K.
        • et al.
        Res2net: a new multi-scale backbone architecture.
        IEEE Trans Pattern Anal MachIntell. 2021; 43: 652-662
        • Hu J.
        • Shen L.
        • Sun G.
        Squeeze-and-excitation networks.
        2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018: 7132-7141
      5. Ravanelli M., Parcollet T., Plantinga P., et al. SpeechBrain: a general-purpose speech toolkit, arxiv:2106.04624. 2021. ArXiv:2106.04624.

        • Park D.S.
        • Chan W.
        • Zhang Y.
        • et al.
        Specaugment: a simple data augmentation method for automatic speech recognition.
        Interspeech 2019. 2019
        • Vaswani A.
        • Shazeer N.
        • Parmar N.
        • et al.
        Attention is all you need.
        in: Guyon I. Luxburg U.V. Bengio S. Wallach H. Fergus R. Vishwanathan S. Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc., 2017
      6. Devlin J., Chang M.-W., Lee K., et al. BERT: pre-training of deep bidirectional transformers for language understanding, arxiv:1810.04805 [cs]arxiv: 1810.04805. 2019.

        • Panayotov V.
        • Chen G.
        • Povey D.
        • et al.
        Librispeech: an ASR corpus based on public domain audio books.
        2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, South Brisbane, Queensland, Australia. 2015: 5206-5210
        • Eyben F.
        • Wöllmer M.
        • Schuller B.
        Opensmile: the Munich versatile and fast open-source audio feature extractor.
        Proceedings of the International Conference on Multimedia - MM ’10. ACM Press, Firenze, Italy2010: 1459
        • Schuller B.
        • Steidl S.
        • Batliner A.
        • et al.
        The INTERSPEECH 2016 computational paralinguistics challenge: deception, sincerity & native language.
        Proc. Interspeech. vol. 2016. 2016: 2001-2005
      7. Schuller B.W., Batliner A., Amiriparian S., et al., The ACM multimedia 2022 computational paralinguistics challenge: vocalisations, stuttering, activity, & mosquitoes. 2022. ArXiv preprint arXiv:2205.06799.

        • Braun F.
        • Erzigkeit A.
        • Lehfeld H.
        • et al.
        Going beyond the cookie theft picture test: detecting cognitive impairments using acoustic features.
        in: Sojka P. Horák A. Kopeček I. Pala K. Text, Speech, and Dialogue. Springer International Publishing, Cham2022: 437-448
        • Botelho C.
        • Schultz T.
        • Abad A.
        • et al.
        Challenges of using longitudinal and cross-domain corpora on studies of pathological speech.
        Proc. Interspeech 2022. 2022: 1921-1925
        • Miller J.L.
        • Grosjean F.
        • Lomanto C.
        Articulation rate and its variability in spontaneous speech: a reanalysis and some implications.
        Phonetica. 1984; 41: 215-225
        • Cercal G.C.S.
        • de Paula A.L.
        • Novis J.M.M.
        • et al.
        Fadiga vocal em professores universitários no início e ao final do ano letivo.
        CoDAS. 2020; 32