Detection of Vocal Fold Image Obstructions in High-Speed Videoendoscopy During Connected Speech in Adductor Spasmodic Dysphonia: A Convolutional Neural Networks Approach



      Adductor spasmodic dysphonia (AdSD) is a neurogenic voice disorder, affecting the intrinsic laryngeal muscle control. AdSD leads to involuntary laryngeal spasms and only reveals during connected speech. Laryngeal high-speed videoendoscopy (HSV) coupled with a flexible fiberoptic endoscope provides a unique opportunity to study voice production and visualize the vocal fold vibrations in AdSD during speech. The goal of this study is to automatically detect instances during which the image of the vocal folds is optically obstructed in HSV recordings obtained during connected speech.


      HSV data were recorded from vocally normal adults and patients with AdSD during reading of the “Rainbow Passage”, six CAPE-V sentences, and production of the vowel /i/. A convolutional neural network was developed and trained as a classifier to detect obstructed/unobstructed vocal folds in HSV frames. Manually labelled data were used for training, validating, and testing of the network. Moreover, a comprehensive robustness evaluation was conducted to compare the performance of the developed classifier and visual analysis of HSV data.


      The developed convolutional neural network was able to automatically detect the vocal fold obstructions in HSV data in vocally normal participants and AdSD patients. The trained network was tested successfully and showed an overall classification accuracy of 94.18% on the testing dataset. The robustness evaluation showed an average overall accuracy of 94.81% on a massive number of HSV frames demonstrating the high robustness of the introduced technique while keeping a high level of accuracy.


      The proposed approach can be used for efficient analysis of HSV data to study laryngeal maneuvers in patients with AdSD during connected speech. Additionally, this method will facilitate development of vocal fold vibratory measures for HSV frames with an unobstructed view of the vocal folds. Indicating parts of connected speech that provide an unobstructed view of the vocal folds can be used for developing optimal passages for precise HSV examination during connected speech and subject-specific clinical voice assessment protocols.

      Key Words

      To read this article in full you will need to make a payment

      Purchase one-time access:

      Academic & Personal: 24 hour online accessCorporate R&D Professionals: 24 hour online access
      One-time access price info
      • For academic or personal research use, select 'Academic and Personal'
      • For corporate R&D use, select 'Corporate R&D Professionals'


      Subscribe to Journal of Voice
      Already a print subscriber? Claim online access
      Already an online subscriber? Sign in
      Institutional Access: Sign in to ScienceDirect


        • Chetri DK
        • Merati AL
        • Blumin JH
        • et al.
        Reliability of the perceptual evaluation of adductor spasmodic dysphonia.
        An Otol Rhinol Laryngol. 2008; 117: 159-165
        • Roy N
        • Gouse M
        • Mauszycki SC
        • et al.
        Task specificity in adductor spasmodic dysphonia versus muscle tension dysphonia.
        The Laryngoscope. 2005; 115: 311-316
        • Roy N
        • Mazin A
        • Awan SN
        Automated acoustic analysis of task dependency in adductor spasmodic dysphonia versus muscle tension dysphonia.
        The Laryngoscope. 2014; 124: 718-724
        • Boutsen F
        • Cannito MP
        • Taylor M
        • et al.
        Botox treatment in adductor spasmodic dysphonia: a meta-analysis.
        J Sp Lang Hear Res. 2002; 45: 469-481
        • Morrison MD
        • Rammage LA
        Muscle misuse voice disorders: description and classification.
        Acta Otolaryngol. 1993; 113: 428-434
        • Yiu E
        • Worrall L
        • Longland J
        • et al.
        Analysing vocal quality of connected speech using Kay's computerized speech lab: a preliminary finding.
        Clin Linguist & Phon. 2000; 14: 295-305
        • Halberstam B
        Acoustic and perceptual parameters relating to connected speech are more reliable measures of hoarseness than parameters relating to sustained vowels.
        ORL. 2004; 66: 70-73
        • Maryn Y
        • Corthals P
        • Van Cauwenberge P
        • et al.
        Toward improved ecological validity in the acoustic measurement of overall voice quality: combining continuous speech and sustained vowels.
        J Voice. 2010; 24: 540-555
        • Lowell SY
        The acoustic assessment of voice in continuous speech.
        SIG 3 Perspect Voice Voice Dis. 2012; 22: 57-63
        • Pietruszewska W
        • Just M
        • Morawska J
        • et al.
        Comparative analysis of high-speed videolaryngoscopy images and sound data simultaneously acquired from rigid and flexible laryngoscope: a pilot study.
        Sci Rep. 2021; 11: 1-14
        • Patel R
        • Dailey S
        • Bless D
        Comparison of high-speed digital imaging with stroboscopy for laryngeal imaging of glottal disorders.
        Ann Otol Rhinol Laryngol. 2008; 117: 413-424
        • Zacharias SRC
        • Myer CM
        • Meinzen-Derr J
        • et al.
        Comparison of videostroboscopy and high-speed videoendoscopy in evaluation of supraglottic phonation.
        Ann Otol Rhinol Laryngol. 2016; 125: 829-837
        • Deliyski DD
        Laryngeal High-Speed Videoendoscopy, in: Laryngeal Evaluation: Indirect Laryngoscopy to High-Speed Digital Imaging.
        Thieme Medical Publishers, New York2010: 243-270
        • Echternach M
        • Döllinger M
        • Sundberg J
        • et al.
        Vocal fold vibrations at high soprano fundamental frequencies.
        J Acoust Soc Am. 2013; 133: EL82-EL87
        • Deliyski DD
        Clinical feasibility of high-speed videoendoscopy.
        SIG 3 perspectives on voice and voice disorders. 2007; 17: 12-16
        • Deliyski DD
        • Petrushev PP
        • Bonilha HS
        • et al.
        Clinical implementation of laryngeal high-speed videoendoscopy: challenges and evolution.
        Folia Phoniatr. et Logop. 2007; 60: 33-44
        • Deliyski DD
        • Hillman RE
        State of the art laryngeal imaging: research and clinical implications.
        Curr Opin Otolaryngol Head Neck Surg. 2010; 18: 147-152
        • Deliyski DD
        • Petrushev PP
        • Bonilha HS
        • et al.
        Clinical imple mentation of laryngeal high-speed videoendoscopy: challenges and evolution.
        Folia Phoniatrica et Logopaedica. 2008; 60: 33-44
        • Woo P
        Objective measures of stroboscopy and high speed video.
        Adv Otorhinolaryngol. 2020; 85: 25-44
        • Mehta DD
        • Deliyski DD
        • Quatieri TF
        • et al.
        Automated measurement of vocal fold vibratory asymmetry from high-speed videoendoscopy recordings.
        J Speech Lang Hear Res. 2011; 54: 47-54
        • Deliyski DD
        • Powell ME
        • Zacharias SR
        • et al.
        Experimental investigation on minimum frame rate requirements of high-speed videoendoscopy for clinical voice assessment.
        Biomed Signal Process Control. 2015; 17: 51-59
        • Zañartu M
        • Mehta DD
        • Ho JC
        • et al.
        Observation and analysis of in vivo vocal fold tissue instabilities produced by nonlinear source-filter coupling: a case study.
        J Acoust Soc Am. 2011; 129: 326-339
        • Mehta DD
        • Deliyski DD
        • Zeitels SM
        Integration of Transnasal Fiberoptic High-Speed Videoendoscopy with Time-Synchronized Recordings of Vocal Function, Innormal & Abnormal Vocal Folds Kinematics: High Speed Digital Phonoscopy (HSDP), Optical Coherence Tomography (OCT) & Narrow Band Imaging. 12. Pacific Voice & Speech Foundation, San Fransisco, CA2015: 105-114
        • Naghibolhosseini M
        • Deliyski DD
        • Zacharias SR
        • et al.
        Temporal segmentation for laryngeal high-speed videoendoscopy in connected speech.
        J Voice. 2018; 32: 256.e1-256.e12
        • Yousef AM
        • Deliyski DD
        • Zacharias SRC
        • et al.
        Spatial segmentation for laryngeal high-speed videoendoscopy in connected speech.
        J Voice. 2020; ([Epub Ahead of Print].)
        • Yousef AM
        • Deliyski DD
        • Zacharias SR
        • et al.
        A hybrid machine-learning-based method for analytic representation of the vocal fold edges during connected speech.
        Appl Sci. 2021; 11: 1179
        • Yousef AM
        • Deliyski DD
        • Zacharias SR
        • et al.
        Automated detection and segmentation of glottal area using deep-learning neural networks in high-speed videoendoscopy during connected speech.
        in: 14th International Conference on Advances in Quantitative Laryngology, Voice and Speech Research (AQL). Bogota, Colombia2021
        • Naghibolhosseini M
        • Deliyski DD
        • Zacharias SR
        • et al.
        A method for analysis of the vocal fold vibrations in connected speech using laryngeal imaging.
        in: Manfredi C. Proceedings of the 10th International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications MAVEBA. Firenze University Press, Firenze, Italy2017
        • Yousef AM
        • Deliyski DD
        • Zacharias SRC
        • et al.
        A Deep Learning Approach for Quantifying Vocal Fold Dynamics during Connected Speech using Laryngeal High-Speed Videoendoscopy.
        J Speech Lang Hear Res. 2022; (In press)
        • Naghibolhosseini M
        • Deliyski DD
        • Zacharias SRC
        • et al.
        Studying vocal fold non-stationary behavior during connected speech using high-speed videoendoscopy.
        J Acoust Soc Am. 2018; 144 (1766-1766)
      1. M Naghibolhosseini, N Heinz, C Brown, et al. “Glottal attack time and glottal offset time comparison between vocally normal speakers and patients with adductor spasmodic dysphonia during connected speech,” in 50th Anniversary Symposium: Care of the Professional Voice, Philadelphia PA, 2021.

        • Naghibolhosseini M
        • Deliyski DD
        • Zacharias SRC
        • et al.
        Glottal attack time in connected speech.
        in: The 11th International Conference on Voice Physiology and Biomechanics. ICVPB, East Lansing, MI2018
      2. C Brown, M Naghibolhosseini, SRC Zacharias et al. “Investigation of high-speed videoendoscopy during connected speech in norm and neurogenic voice disorder,” in Michigan Speech-Language-Hearing Association (MSHA) Annual Conference, East Lansing, MI, 2019.

        • Olthoff A
        • Woywod C
        • Kruse E
        Stroboscopy versus high-speed glottography: a comparative study.
        The Laryngo scope. 2007; 117: 1123-1126
        • Popolo PS
        Investigation of flexible high-speed video nasolaryngoscopy.
        J Voice. 2018; 32: 529-537
        • Hinton G
        Deep learning — a technology with the potential to transform health care.
        J Am Med Assoc. 2018; 320: 1101
        • Esteva A
        • Robicquet A
        • Ramsundar B
        • et al.
        A guide to deep learning in healthcare.
        Nat Med. 2019; 25: 24-29
        • Moccia S
        • Vanone GO
        • Momi ED
        • Laborai A
        • Guastini L
        • Peretti G
        • Mattos LS
        Learning-based classification of informative laryngoscopic frames.
        Comput Methods Programs Biomed. 2018; 158: 21-30
        • Patrini I
        • Ruperti M
        • Moccia S
        • et al.
        Transfer learning for informative-frame selection in laryngoscopic videos through learned features.
        Med Biol Eng Comput. 2020; 58: 1225-1238
      3. A Galdran, P Costa and A Campilho, “Real-time informative laryngoscopic frame classification with pre-trained convolutional neural networks,” In: 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), Venezia, Italy2019.

        • Ren J
        • Jing X
        • Wang J
        • et al.
        Automatic recognition of laryngoscopic images using a deep-learning technique.
        The Laryngoscope. 2020; 130: E686-E693
        • Xiong H
        • Lin P
        • Yu JG
        • et al.
        Computer-aided diagnosis of laryngeal cancer via deep learning based on laryngoscopic images.
        EBioMedicine. 2019; 48: 92-99
        • Cho WK
        • Yeong JL
        • Joo HA
        • et al.
        Diagnostic accuracies of laryngeal diseases using a convolutional neural network-based image classification system.
        The Laryngoscope. 2021; 131: 2558-2566
        • Russakovsky O
        • Deng J
        • Krause J
        • et al.
        ImageNet large scale visual recognition challenge.
        Int J Comput Vis. 2015; 115: 211-252
        • Hirasawa T
        • Aoyama K
        • Tanimoto T
        • et al.
        Application of artificial intelligence using a convolutional neural network for detecting gastric cancer in endoscopic images.
        Gastric Cancer. 2018; 21: 653-660
        • Yu L
        • Chen H
        • Dou Q
        • et al.
        Integrating online and offline three-dimensional deep learning for automated polyp detection in colonoscopy videos.
        IEEE J Biomed Health Inform. 2017; 21: 65-75
        • Ronneberger O
        • Fischer P
        • Brox T
        U-Net: convolutional networks for biomedical image segmentation.
        in: Int. Conf. Med. Image Comp. Comp.-ass. Interv. MICCAI, Munich, Germany2015
      4. DP Kingma and J Ba, Adam: a method for stochastic optimization,arXiv preprint arXiv: 1412.6980., 2014.