Advertisement

Comparison of Acoustic Voice Features Derived From Mobile Devices and Studio Microphone Recordings

Open AccessPublished:November 12, 2022DOI:https://doi.org/10.1016/j.jvoice.2022.10.006

      Summary

      Objectives/Hypothesis

      Improvements in mobile device technology offer new opportunities for remote monitoring of voice for home and clinical assessment. However, there is a need to establish equivalence between features derived from signals recorded from mobile devices and gold standard microphone-preamplifiers. In this study acoustic voice features from android smartphone, tablet, and microphone-preamplifier recordings were compared.

      Methods

      Data were recorded from 37 volunteers (20 female) with no history of speech disorder and six volunteers with Huntington's disease (HD) during sustained vowel (SV) phonation, reading passage (RP), and five syllable repetition (SR) tasks. The following features were estimated: fundamental frequency median and standard deviation (F0 and SD F0), harmonics-to-noise ratio (HNR), local jitter, relative average perturbation of jitter (RAP), five-point period perturbation quotient (PPQ5), difference of differences of amplitude and periods (DDA and DDP), shimmer, and amplitude perturbation quotients (APQ3, APQ5, and APQ11).

      Results

      Bland-Altman analysis revealed good agreement between microphone and mobile devices for fundamental frequency, jitter, RAP, PPQ5, and DDP during all tasks and a bias for HNR, shimmer and its variants (APQ3, APQ5, APQ11, and DDA). Significant differences were observed between devices for HNR, shimmer, and its variants for all tasks. High correlation was observed between devices for all features, except SD F0 for RP. Similar results were observed in the HD group for SV and SR task. Biological sex had a significant effect on F0 and HNR during all tests, and for jitter, RAP, PPQ5, DDP, and shimmer for RP and SR. No significant effect of age was observed.

      Conclusions

      Mobile devices provided good agreement with state of the art, high-quality microphones during structured speech tasks for features derived from frequency components of the audio recordings. Caution should be taken when estimating HNR, shimmer and its variants from recordings made with mobile devices.

      Key Words

      INTRODUCTION

      Mobile devices are widely available, with the majority of the global population having a least one device, such as a smartphone or tablet. Portable, easy-to-use technology has been making its way into health research in recent years, though the use of notifications reminders,
      • Hou MY
      • Hurwirtz S
      Using daily text-message reminders to improve adherence with.
      passively monitoring health,
      • Reyes BA
      • Reljin N
      • Kong Y
      • et al.
      Tidal volume and instantaneous respiration rate estimation using a volumetric surrogate signal acquired via a smartphone camera.
      • Nam Y
      • Kong Y
      • Reyes B
      • et al.
      Monitoring of heart and breathing rates using dual cameras on a smartphone.
      • Doheny EP
      • Lowery MM
      • Russell A
      • et al.
      Estimation of respiration rate and sleeping position using a wearable accelerometer.
      or actively engaging with participants.
      • Ginis P
      • Nieuwboer A
      • Dorfman M
      • et al.
      Feasibility and effects of home-based smartphone-delivered automated feedback training for gait in people with Parkinson’s disease: a pilot randomized controlled trial.
      • Larson EC
      • Goel M
      • Boriello G
      • et al.
      SpiroSmart: using a microphone to measure lung function on a mobile phone.
      • Vatanparvar K
      • Nathan V
      • Nemati E
      • et al.
      SpeechSpiro: lung function assessment from speech pattern as an alternative to spirometry for mobile health tracking.
      The recent COVID-19 pandemic has highlighted the need to further harness the potential of mobile devices, to enable research and healthcare monitoring to be conducted from a distance when necessary. Although there is still some resistance to the feasibility of real-world applications for mobile devices,
      • Horin AP
      • McNeely ME
      • Harrison EC
      • et al.
      Usability of a daily mHealth application designed to address mobility, speech and dexterity in Parkinson’s disease.
      the evidence indicates that healthcare workers
      • Wilson R
      • Small J
      Care staff perspectives on using mobile technology to support communication in long-term care: mixed methods study.
      and patients
      • Hussein WF
      • et al.
      The mobile health readiness of people receiving in-center hemodialysis and home dialysis.
      are ready to take advantage of this opportunity to improve care and communication in hospitals, clinics, and homes.
      The quality of microphones embedded in mobile devices has improved with advances in technology, offering a low-cost, and accessible alternative to studio microphones traditionally used for speech analysis. Changes in acoustic voice features can be used to quantify speech impairment in neurodegenerative disorders including Parkinson's disease (PD) and Huntington's disease (HD),
      • Volkmann J
      • Hefter H
      • Lange HW
      • et al.
      Impairment of temporal organization of speech in basal ganglia diseases.
      • Scott Kelso JA
      • Tuller B
      • Harris KS
      A ‘dynamic pattern’ perspective on the control and coordination of movement.
      • Smith A
      • Mcfarland DH
      • Weber CM
      Interactions between speech and finger movements.
      and offer the potential for quantitative clinical evaluation of symptoms and disease progression. However, there is a need to ensure the equivalence of acoustic voice features derived from recordings made from microphones in mobile devices before they can be deployed in healthcare applications.
      The use of mobile devices to record structured and free speech has been previously investigated, mainly in healthy individuals
      • Oliveira G
      • Fava G
      • Baglione M
      • et al.
      Mobile digital recording: adequacy of the iRig and IOS device for acoustic and perceptual analysis of normal voice.
      • Vogel AP
      • Rosen KM
      • Morgan AT
      • et al.
      Comparability of modern recording devices for speech analysis: smartphone, landline, laptop, and hard disc recorder.
      • Jannetts S
      • Schaeffler F
      • Beck J
      • et al.
      Assessing voice health using smartphones: bias and random error of acoustic voice parameters captured by different smartphone types.
      • Zhang C
      • Jepson K
      • Lohfink G
      • et al.
      Comparing acoustic analyses of speech data collected remotely.
      • Schaeffler F
      • Jannetts S
      • Beck J
      Reliability of Clinical Voice Parameters Captured With Smartphones – Measurements of Added Noise and Spectral Tilt.
      and with synthetic voice samples,
      • Manfredi C
      • Lebacq J
      • Cantarella G
      • et al.
      Smartphones offer new opportunities in clinical voice research.
      ,
      • Lebacq J
      • Schoentgen J
      • Cantarella G
      • et al.
      Maximal ambient noise levels and type of voice material required for valid use of smartphones in clinical voice research.
      but also in populations with speech disorders, such as glottic cancer, vocal fold paralysis, and dysphonia.
      • Kim G-H
      • Kang D-H
      • Lee Y-Y
      • et al.
      Recording quality of smartphone for acoustic analysis.
      • Maryn Y
      • Ysenbaert F
      • Zarowski A
      • et al.
      Mobile communication devices, ambient noise, and acoustic voice measures.
      • van der Woerd B
      • Wu M
      • Parsa V
      • et al.
      Evaluation of acoustic analyses of voice in nonoptimized conditions.
      • Uloza V
      • Ulozaitė-Stanienė N
      • Petrauskas T
      • et al.
      Accuracy of acoustic voice quality index captured with a smartphone – measurements with added ambient noise.
      No significant differences were reported between recordings of the open vowel [a:] from an iOS smartphone allied with an iRig when compared with the gold standard computer/preamplifier configuration in healthy individuals.
      • Oliveira G
      • Fava G
      • Baglione M
      • et al.
      Mobile digital recording: adequacy of the iRig and IOS device for acoustic and perceptual analysis of normal voice.
      A subsequent study in an English-speaking cohort, during sustained vowel (SV) and reading passage (RP), compared acoustic features from audio recorded with both studio microphone and iPhone.
      • Jannetts S
      • Schaeffler F
      • Beck J
      • et al.
      Assessing voice health using smartphones: bias and random error of acoustic voice parameters captured by different smartphone types.
      Acceptable agreement in the mean fundamental frequency across devices was observed, however the random error for jitter and shimmer were deemed too high for practical applications. Another study in healthy participants, which analyzed SV and RP, showed correlation between features derived from mobile devices, landline phone, and head-mounted microphone, but only the fundamental frequency and cepstral peak prominence had an error below 10% when compared with microphone recordings.
      • Vogel AP
      • Rosen KM
      • Morgan AT
      • et al.
      Comparability of modern recording devices for speech analysis: smartphone, landline, laptop, and hard disc recorder.
      Significant bias has also been reported for all acoustic voice features analyzed during SV and RP for a range of Android and iOS devices simultaneously recording voice from 22 speakers.
      • Schaeffler F
      • Jannetts S
      • Beck J
      Reliability of Clinical Voice Parameters Captured With Smartphones – Measurements of Added Noise and Spectral Tilt.
      Shimmer and HNR, but not fundamental frequency, were reported to differ across devices in a Flemish Dutch-speaking cohort of both healthy and patients with a variety of speech disorders, when assessing SV and RP.
      • Maryn Y
      • Ysenbaert F
      • Zarowski A
      • et al.
      Mobile communication devices, ambient noise, and acoustic voice measures.
      Considering the available evidence, while various approaches have been used to compare the performance of mobile devices and studio microphones, their findings are not always in agreement. This may be due to differences in the language spoken, devices, and methods of speech collection or analysis. Additionally, the majority of validation studies presented to-date have focused on analyzing acoustic voice features in speech tests comprised of SV and RP tasks, few have included syllable repetition (SR) in the analysis.
      • Bocklet T
      • Steidl S
      • Nöth E
      • et al.
      Automatic evaluation of parkinson's speech - acoustic, prosodic and voice related cues.
      ,
      • Portnoy RA
      • Aronson AE
      Diadochokinetic syllable rate and regularity in normal and in spastic and ataxic dysarthric subjects.
      Consequently, it remains unclear whether acoustic voice features recorded using mobile devices provide sufficient agreement with gold standard microphone to be used in remote monitoring and clinical assessment.
      Several studies have analyzed speech in PD using mobile technology,
      • Bocklet T
      • Steidl S
      • Nöth E
      • et al.
      Automatic evaluation of parkinson's speech - acoustic, prosodic and voice related cues.
      ,
      • Orozco-Arroyave JR
      • Vásquez-Correa JC
      • Klumpp P
      • et al.
      Apkinson: the smartphone application for telemonitoring Parkinson’s patients through speech, gait and hands movement.
      • Tsanas A
      • Little MA
      • McSharry PE
      • et al.
      Accurate telemonitoring of Parkinsons disease progression by noninvasive speech tests.
      • Rusz J
      • Hlavnička J
      • Tykalova T
      • et al.
      Smartphone allows capture of speech abnormalities associated with high risk of developing Parkinson’s disease.
      including a mobile device validation study.
      • Jeancolas L
      • Mangone G
      • Corvol JC
      • et al.
      Comparison of telephone recordings and professional microphone recordings for early detection of Parkinson’s disease, using mel-frequency cepstral coefficients with Gaussian mixture models.
      However, though quantitative analysis of speech in HD is receiving increasing attention, the reliability of mobile devices for monitoring speech in HD has not been examined to-date. Characterizing changes and patterns of speech in those with HD can add to our understanding of this disorder. Such differences, measured reliably with mobile technology, may be used in clinical evaluations of disease progression or person-centered outcomes in future clinical trials of interventions for those with HD. Lessons learnt from studying those with HD may have implications for those with other less well defined dementias, now and into the future.
      The objective of this study was to compare acoustic features extracted from voice recordings using mobile devices (smartphone and tablet) and a high-quality studio microphone allied with an audio interface/preamplifier in a group of control participants and participants with HD. Three speech tests were examined, SV, RP, and SR, and acoustic voice features were extracted, focusing primarily on features related to voice quality including jitter and shimmer. The results provide guidance on the use of mobile devices for recording speech for the analysis of acoustic voice features during vocal tasks typically used during motor speech assessment.

      MATERIAL AND METHODS

      Data collection

      Thirty-seven healthy adults (aged 34.16 ± 12.75 years; 20 female) and six participants with genetically confirmed HD (aged 53.5 ± 16.78 years; three female and three males, all with manifest stages of HD) gave their written consent to participate in the study which was approved by the University College Dublin Human Research Ethics Committee, in collaboration with Bloomfield Health Services, Dublin. Voice was recorded simultaneously with a smartphone device (Google Pixel 4), a tablet (Samsung Tab S6 Lite) and a headset omnidirectional microphone (6066, DPA, Denmark) with a sound interface (Mix Pre3 II, Sound Devices, USA) connected to a laptop. The recordings took place in the university research laboratory for the healthy participants (control group) and in the clinical setting for the HD group, with environmental noise measured with a sound level meter, always below 45 dBA. Participants were seated wearing the headset microphone, and facing the tablet and the smartphone, which were both placed on tripods, at the height of the participants’ mouth, approximately 5 to 10 cm away, Figure 1. The equipment and protocol for data collection were selected according to the recommendations for instrumental assessment of voice from the American Speech-Language-Hearing Association.
      • Patel RH
      • Rita R
      • Awan SN
      • et al.
      Recommended protocols for instrumental assessment of voice: American Speech- Language-Hearing Association Expert Panel to Develop a Protocol for Instrumental Assessment of Vocal Function.
      Acoustic data were recorded during SV phonation ([a:]),
      • Fairbanks G
      The rainbow passage.
      ,
      • Rusz J
      • Tykalova T
      • Ramig LO
      • et al.
      Guidelines for speech recording and acoustic analyses in dysarthrias of movement disorders.
      reading of the Rainbow Passage
      • Fairbanks G
      The rainbow passage.
      ,
      • Rusz J
      • Tykalova T
      • Ramig LO
      • et al.
      Guidelines for speech recording and acoustic analyses in dysarthrias of movement disorders.
      and SR ([pa], [ta], [ka], [pataka], and [pati]) tasks, commonly known as sequential and alternating motor rate tasks.
      • Rusz J
      • Tykalova T
      • Ramig LO
      • et al.
      Guidelines for speech recording and acoustic analyses in dysarthrias of movement disorders.
      Participants were asked to sustain the vowel “ah” for as long as they could during one breath. They were instructed to read the Rainbow Passage at their own pace, as they would naturally read out loud. Finally, participants were asked to repeat each syllable as fast and as clearly as they could for 5 seconds. Participants repeated the tasks two or three times. Audio data were sampled at 24-bit, 44.1 kHz or 48 kHz, and saved in an uncompressed .wav format.
      FIGURE 1
      FIGURE 1Schematic diagram of experimental set-up for speech recording with three devices simultaneously.

      Signal analysis

      Preprocessing

      Speech signals were filtered with a fourth order Butterworth band-pass filter between the 10 Hz and 5 kHz, and downsampled to 44.1 kHz for consistency across devices. The mean was removed from each signal, 0.5 seconds of data at the start and end of the signal was discarded to avoid edge errors, and the signal was padded with zeros for 2 seconds at the start and end. Finally, each signal was normalized with respect to its maximum amplitude.

      Voiced/unvoiced detector

      A voiced/unvoiced detector was developed to ensure that the estimated features corresponded to voiced frames. The Teager-Kaiser Energy operator (TKO)
      • Kaiser JF
      On a simple algorithm to calculate the ‘energy’ of a signal.
      was applied to all recordings, followed by a three-sample maximum rolling window, and a five-sample mean rolling window. Finally, the recordings were low-passed filtered at 10 Hz with a second order Butterworth filter, to extract the signal envelope. The methods for voiced/unvoiced detector for the recordings for each speech task (SV, SR, PR) differed after this step.
      For the SV task, a threshold was implemented to detect candidates for the onset and offset times, Figure 2. The threshold for the control group was set as 10% of the mean of the Teager-Kaiser envelope, and for the HD group, 20% of the mean of the envelope. The onset/offset times were defined for each phonation. If the signal contained more than one voice break during phonation, then the interval between each pair of detected voice onset/offset times was calculated and the pair with the maximum interval chosen.
      FIGURE 2
      FIGURE 2Sample audio data recorded using a microphone from a representative control subject. The onset and offset times determined using the voiced/unvoiced detector for each speech task are indicated.
      The onset/offset detector was used also for the detection of the voice signal during RP, with a threshold of 70% of the mean of the Teager-Kaiser envelope for the control group and 80% of the mean of the envelope for the HD group. All the voiced parts during the passage were detected, Figure 2. The unvoiced parts were not included in the analysis.
      For the SR, the onset/offset detector was deployed with the threshold for detection set as the mean of the Teager-Kaiser envelope for both the control and the HD group. Each pair of onset and offset times corresponded to a syllable, Figure 2. A minimum duration of the syllable phonation of 0.05 seconds was applied to avoid detection of noise with short duration and high amplitude. Each syllable repetition was analyzed individually without windowing. The first and last syllables were removed to avoid detecting noise at the beginning and end of the recording, unless five or less syllable repetitions were detected in the trial.

      Feature estimation

      For SV, a 0.5 seconds moving window with 50% overlap was applied to the detected voiced data for feature estimation. Acoustic voice features were estimated for each 0.5-second epoch and averaged across all epochs to obtain the final estimate for that feature. To avoid edge errors, the first and last window of each signal were discarded if the vowel was sustained for more than 5 seconds total. Otherwise, the entirety of the signal was analyzed.
      For PR, each voiced interval was used as one epoch for feature calculation. Finally, acoustic voice features were estimated for each syllable for the SR task and the median value across one repetition of the task was taken as the value of that feature.
      The features examined are commonly used in voice assessment in neurodegenerative diseases, such as PD
      • Tsanas A
      • Little MA
      • McSharry PE
      • et al.
      Nonlinear speech analysis algorithms mapped to a standard metric achieve clinically useful quantification of average Parkinson's disease symptom severity.
      and HD,
      • Rusz J
      • Saft C
      • Schlegel U
      • et al.
      Phonatory dysfunction as a preclinical symptom of huntington disease.
      and were compared across devices. The following frequency-based features were estimated for each recording: Median and standard deviation of the fundamental frequency (F0 and SD F0), harmonics-to-noise ratio (HNR), local jitter in percentage (jitter), and the jitter variants: relative average perturbation of jitter in percentage (RAP), five-point period perturbation quotient in percentage (PPQ5), difference of differences of periods in percentage (DDP). In addition, the following amplitude-based features were also estimated: shimmer in decibels (shimmer), and the shimmer variants: three-point amplitude perturbation quotient in percentage (APQ3), the five-point amplitude quotient in percentage (APQ5), the 11-point amplitude quotient in percentage (APQ11), the difference of differences of amplitude in percentage (DDA). The features were estimated using custom Python scripts as described below. Following publication, the control data and Python code for estimating acoustic voice features will be made publicly available.
      A number of different methodologies and algorithms have been used to estimate the fundamental frequency of audio signals.
      • Boersma P
      Acurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound.
      • de Cheveigné A
      • Kawahara H
      YIN, a fundamental frequency estimator for speech and music.
      • Kasi K
      • Zahorian SA
      Yet another algorithm for pitch tracking.
      • De Cheveigné
      Speech f0 extraction based on Licklider's pitch perception model.
      Here, a 70:30 weighted linear combination was implemented to calculate the fundamental frequency,
      • Tsanas A
      • Little MA
      • McSharry PE
      • et al.
      Nonlinear speech analysis algorithms mapped to a standard metric achieve clinically useful quantification of average Parkinson's disease symptom severity.
      with 70% being comprised of a modified Praat F0 calculation
      • Boersma P
      Acurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound.
      and a Librosa
      • McFee B
      • Raffel C
      • Liang D
      • et al.
      librosa: audio and music signal analysis in python.
      implementation of the YIN algorithm
      • de Cheveigné A
      • Kawahara H
      YIN, a fundamental frequency estimator for speech and music.
      accounted for the remaining 30%. The 70:30 weighting was selected to provide accurate estimates of F0 across epochs of varying length, including shorter epochs for which the YIN algorithm is susceptible to higher variability. The modified version of the Praat algorithm for estimating the fundamental frequency was implemented in Python without the path finder and with a preference for higher frequencies. For estimation of the fundamental frequency, each voiced epoch was divided into 0.04-second windows with 75% overlap. Within each window, candidates for F0 that fell within the frequency range of interest (75 Hz and 600 Hz) were identified and the candidate that represented the lowest lag selected as F0. The median value of F0 across all windows within each epoch was estimated.
      HNR was also estimated based on the Praat algorithm,
      • Boersma P
      Acurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound.
      which calculates the heights of the peaks of the autocorrelation function. Each epoch was divided into 0.08-second windows with 75% overlap. The highest amplitude peaks within the autocorrelation function were identified, the two with the greatest amplitude were chosen and their difference in amplitude calculated. If the absolute difference was less than 10% the algorithm defaulted to the candidate lying at the higher frequency. Otherwise, the algorithm chose the candidate with the highest amplitude in the autocorrelation.
      The jitter and its variants were estimated from a period sequence formed by taking the inverse of the fundamental frequency for each window within an epoch as described above. Shimmer and its variants were calculated from the amplitude sequence formed by the normalized value of the amplitude of the signal at each instance of glottal opening and closing identified using the MATLAB toolbox Voicebox for each epoch.
      • Brookes M
      Voicebox: speech processing toolbox for matlab.
      In addition to the effect of device, acoustic voice features estimated using the entire phonation of the vowel were compared with the values obtained by averaging across overlapping 0.5-second epochs.

      Statistical analysis

      Bland Altman analysis
      • Bland JM
      • Altman DG
      Statistical methods for assessing agreement between two methods of clinical measurement.
      was used to examine the agreement between features estimated from acoustic data recorded using the different devices. Linear mixed models (LMM)

      B Winter, “Linear models and linear mixed effects models in R with linguistic applications,” arXiv Prepr. arXiv1308.5499, 2013.

      were implemented in R

      D Bates, M Mächler, B Bolker, et al. “Fitting linear mixed-effects models using lme4,” arXiv Prepr. arXiv1406.5823, 2014.

      for each speech task and cohort group separately, to investigate differences in the acoustic features across devices for each cohort. For all features, device, age, and biological sex were included as fixed effects and the intercept for participants was included as a random effect. For the SR task, all five syllables were included into the same model, and syllable was added as another random effect. P values were obtained by comparing the model with a null model without the effect of device using an ANOVA.

      B Winter, “Linear models and linear mixed effects models in R with linguistic applications,” arXiv Prepr. arXiv1308.5499, 2013.

      The effect of device was investigated, along with the effect of age and sex as a secondary objective. If a significance difference was observed, the False Discovery Rate was used as a post hoc test within the devices group. Bonferroni correction was applied to account for multiple comparisons within each family of speech task (SV, RP, and SR), with the threshold for significance set at P < 0.004. Correlation analysis between microphone and mobile devices was performed using the Spearman rank-order correlation coefficient. Correlations were examined for the control group only, due to the small sample size of the HD group. Threshold levels of significance for correlation coefficients were also adjusted for multiple comparisons using Bonferroni correction.

      RESULTS

      The results for each of the vocal tasks are presented separately in four sections: 3.1 Sustained vowel phonation, 3.2 Reading passage, 3.3 Syllable repetition, and 3.4 Effect of biological sex, age, and epoch size. Within the first three sections (3.1, 3.2, and 3.3), an analysis of the Bland-Altman plots is first presented, followed by the results of the linear mixed models and the correlation analysis between devices. Lastly, Section 3.4 presents the effect of biological sex and age on acoustic voice features across the devices is described. At the end of Section 3.4, the effect of the length of the time epoch chosen for analysis in the SV test is also reported. Exact P values for all statistical tests are presented in the Appendix.

      Sustained vowel phonation

      Bland-Altman analysis revealed a significant bias in HNR, and shimmer and its variants, with both mobile devices when compared with the microphone recording during SV phonation in control and HD groups, Figure 3.
      FIGURE 3
      FIGURE 3Bland-Altman plots comparing the gold standard microphone with mobile devices during the SV for the control group: (a) fundamental frequency, (b) standard deviation of the fundamental frequency, (c) harmonics-to-noise ratio, (d) jitter, (e) 5 five-point perturbation quotient, and (f) shimmer; and for the HD group: (g) fundamental frequency, (h) standard deviation of the fundamental frequency, (i) harmonics-to-noise ratio, (j) jitter, (k) 5 five-point perturbation quotient, and (l) shimmer. Data points and confidence intervals for the comparison of microphone and smartphone are indicated in blue; data points and confidence intervals for the comparison of microphone and tablet are in green (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.).
      Linear mixed model analysis revealed significant differences in HNR, and shimmer and its variants, between microphone and mobile devices for both cohort groups, with lower HNR and higher shimmer values observed with the mobile devices, Figure 4. Values for the mean and standard deviation of the estimated acoustic voice features along with the P values for the linear-mixed models are presented in the Appendix in Tables A.1 and A.2 for the control and HD groups, respectively. Post hoc analysis confirmed significant differences in these features between microphone and smartphone, and microphone and tablet, but not between smartphone and tablet for these features, Table A.2. A strong correlation was observed between both mobile and the studio microphone recordings for all features examined for the control group, Table 1.
      FIGURE 4
      FIGURE 4Violin plots illustrating the distribution of the estimated features during SV for the control group: (a) fundamental frequency, (b) standard deviation of the fundamental frequency, (c) harmonics-to-noise ratio, (d) jitter, (e) 5 five-point perturbation quotient, and (f) shimmer; and for the HD group: (g) fundamental frequency, (h) standard deviation of the fundamental frequency, (i) harmonics-to-noise ratio, (j) jitter, (k) 5 five-point perturbation quotient, and (l) shimmer.
      TABLE 1Correlation Coefficients Rho for All the Features Estimated From Each Task Recorded From the Control Group
      FeatureSustained VowelReading PassageSyllable Repetition
      Microphone Vs smartphoneMicrophone Vs TabletMicrophone Vs SmartphoneMicrophone Vs TabletMicrophone Vs SmartphoneMicrophone Vs Tablet
      F01.00
      P < 0.001.
      1.00
      P < 0.001.
      1.00
      P < 0.001.
      1.00
      P < 0.001.
      0.98
      P < 0.001.
      0.96
      P < 0.001.
      SD F00.97
      P < 0.001.
      0.82
      P < 0.001.
      0.73
      P < 0.001.
      0.55
      P < 0.004.
      0.74
      P < 0.001.
      0.72
      P < 0.001.
      HNR0.97
      P < 0.001.
      0.94
      P < 0.001.
      0.99
      P < 0.001.
      0.98
      P < 0.001.
      0.97
      P < 0.001.
      0.97
      P < 0.001.
      Jitter0.93
      P < 0.001.
      0.82
      P < 0.001.
      0.83
      P < 0.001.
      0.72
      P < 0.001.
      0.89
      P < 0.001.
      0.89
      P < 0.001.
      RAP0.91
      P < 0.001.
      0.83
      P < 0.001.
      0.94
      P < 0.001.
      0.84
      P < 0.001.
      0.90
      P < 0.001.
      0.89
      P < 0.001.
      PPQ50.90
      P < 0.001.
      0.81
      P < 0.001.
      0.89
      P < 0.001.
      0.80
      P < 0.001.
      0.88
      P < 0.001.
      0.84
      P < 0.001.
      DDP0.91
      P < 0.001.
      0.83
      P < 0.001.
      0.93
      P < 0.001.
      0.84
      P < 0.001.
      0.90
      P < 0.001.
      0.89
      P < 0.001.
      Shimmer0.88
      P < 0.001.
      0.80
      P < 0.001.
      0.95
      P < 0.001.
      0.90
      P < 0.001.
      0.84
      P < 0.001.
      0.78
      P < 0.001.
      APQ30.83
      P < 0.001.
      0.85
      P < 0.001.
      0.93
      P < 0.001.
      0.88
      P < 0.001.
      0.83
      P < 0.001.
      0.74
      P < 0.001.
      APQ50.88
      P < 0.001.
      0.83
      P < 0.001.
      0.90
      P < 0.001.
      0.86
      P < 0.001.
      0.91
      P < 0.001.
      0.85
      P < 0.001.
      APQ110.88
      P < 0.001.
      0.77
      P < 0.001.
      0.94
      P < 0.001.
      0.88
      P < 0.001.
      0.95
      P < 0.001.
      0.95
      P < 0.001.
      DDA0.83
      P < 0.001.
      0.85
      P < 0.001.
      0.92
      P < 0.001.
      0.86
      P < 0.001.
      0.81
      P < 0.001.
      0.76
      P < 0.001.
      low asterisk P < 0.004.
      low asterisklow asterisk P < 0.001.

      Reading passage

      During passage-reading, Bland-Altman analysis revealed a significant bias in HNR, jitter, RAP, DDP, shimmer and its variants between microphone and mobile devices for the control group. Additionally, significant bias was observed for SD F0 and PPQ5 between microphone and tablet, Figure 5. Significant bias was observed for F0, HNR, shimmer and its variants when comparing microphone and mobile devices for recordings for the HD group. Additional bias was observed for RAP, PPQ5 and DDP when comparing microphone with smartphone for the HD group, Figure 5.
      FIGURE 5
      FIGURE 5Bland-Altman plots comparing the gold standard microphone with mobile devices during RP for the control group: (a) fundamental frequency, (b) standard deviation of the fundamental frequency, (c) harmonics-to-noise ratio, (d) jitter, (e) 5 five-point perturbation quotient, and (f) shimmer; and for the HD group: (g) fundamental frequency, (h) standard deviation of the fundamental frequency, (i) harmonics-to-noise ratio, (j) jitter, (k) 5 five-point perturbation quotient, and (l) shimmer. Blue markers and confidence intervals refer to the comparison of microphone and smartphone, and green markers and confidence intervals refer to the comparison of microphone and tablet (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.).
      Linear mixed models confirmed an effect of device on all features except F0 in the control group, Figure 6 and Table A.3. Further post hoc analysis revealed differences between microphone and smartphone for HNR, shimmer and its variants. Differences between microphone and tablet were observed for all features that reported an effect of device. In the HD group, significant differences were observed in HNR, PPQ5, shimmer, APQ3, APQ5, and DDA. Post hoc analysis revealed differences between microphone and smartphone, and between microphone and tablet for HNR, shimmer, APQ3, APQ5, and DDA, Table A.4. As for the RP task, correlation analysis revealed that all the features were highly correlated across devices except SD F0 for the control group, Table 1.
      FIGURE 6
      FIGURE 6Violin plots illustrating the distribution of the estimated features during RP for the control group: (a) fundamental frequency, (b) standard deviation of the fundamental frequency, (c) harmonics-to-noise ratio, (d) jitter, (e) 5 five-point perturbation quotient, and (f) shimmer; and for the HD group: (g) fundamental frequency, (h) standard deviation of the fundamental frequency, (i) harmonics-to-noise ratio, (j) jitter, (k) 5 five-point perturbation quotient, and (l) shimmer.

      Syllable repetition

      Similar to the sustained phonation task, a significant bias was observed for HNR, shimmer and its variants when comparing microphone with mobile devices for the control group, Figure 7. A significant bias was also observed in the control group when comparing jitter, RAP and DDP from the tablet and microphone recordings. When analyzing features from the HD group, a significant bias with mobile devices was observed for HNR, PPQ5, and shimmer and its variants.
      FIGURE 7
      FIGURE 7Bland-Altman plots comparing the gold standard microphone with mobile devices during the SR for the control group: (a) fundamental frequency, (b) standard deviation of the fundamental frequency, (c) harmonics-to-noise ratio, (d) jitter, (e) 5 five-point perturbation quotient, and (f) shimmer; and for the HD group: (g) fundamental frequency, (h) standard deviation of the fundamental frequency, (i) harmonics-to-noise ratio, (j) jitter, (k) 5 five-point perturbation quotient, and (l) shimmer. Blue markers and confidence intervals refer to the comparison of microphone and smartphone, and green markers and confidence interval refer to the comparison of microphone and tablet (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.).
      A significant effect of device was observed for HNR, shimmer and its variants, in both groups, Figure 8, Tables A.5, A.6. Additionally, an effect of device was also observed for RAP in the control group. Post hoc analysis revealed significant differences between all three devices for HNR, shimmer, APQ3, APQ5, and DDA in the control group, and between microphone and tablet for RAP and APQ11. For the HD group, an effect of device was observed when comparing microphone and smartphone, and microphone and tablet, but not between smartphone and tablet. Correlation analysis confirmed a high correlation coefficient between microphone and smartphone, and microphone and tablet from the control group estimated features (P < 0.001), Table 1.
      FIGURE 8
      FIGURE 8Violin plots illustrating the distribution of the estimated features during the SR. Data presented here represent control group (N = 37) and participants with HD (N = 6), repeating five different syllables. Control group: (a) fundamental frequency, (b) standard deviation of the fundamental frequency, (c) harmonics-to-noise ratio, (d) jitter, (e) 5 five-point perturbation quotient, and (f) shimmer; and for the HD group: (g) fundamental frequency, (h) standard deviation of the fundamental frequency, (i) harmonics-to-noise ratio, (j) jitter, (k) 5 five-point perturbation quotient, and (l) shimmer.

      Effect of biological sex, age, and epoch size

      An effect of biological sex on the fundamental frequency and on the harmonics-to-noise ratio was observed for the control group (P < 0.004) during the SV task. No significant difference with biological sex was observed for any feature in the HD group during the SV task, Table A.7. An effect of biological sex on F0, HNR, shimmer and its variants was also observed during the RP task for the control group, with an effect of biological sex observed only for F0 for the HD group. For the SR task, an effect of biological sex was observed in all features with the exception of APQ5, APQ11, and DDA for the control group, Table A.7. No significant effect of biological sex was observed in the HD group.
      Age had no significant effect in any of the features estimated from all three speech tasks, SV, RP, and SR, Table A.8. Finally, the effect of epoch length was investigated for the SV task. When the entire signal was analyzed the results were consistent with those obtained using a moving epoch, with an effect of device observed for HNR, shimmer, and its variants for both cohort groups, Table A.9.

      DISCUSSION

      In this study, quantitative acoustic voice features extracted from speech recordings made using Android mobile devices were compared with those derived from recordings made using gold standard technology, an omnidirectional microphone allied with a preamplifier. Acoustic voice features were estimated from audio recordings during three speech tasks, sustained phonation of the vowel [a:], reading of the Rainbow Passage, and repetition of syllables in healthy control participants and in a small cohort of participants with HD.
      The results indicate that fundamental frequency could be estimated with similar accuracy across all devices and speech tasks examined for both the control group in a laboratory setting and the HD group in a clinical setting, consistent with a number of previous studies.
      • Oliveira G
      • Fava G
      • Baglione M
      • et al.
      Mobile digital recording: adequacy of the iRig and IOS device for acoustic and perceptual analysis of normal voice.
      ,
      • Jannetts S
      • Schaeffler F
      • Beck J
      • et al.
      Assessing voice health using smartphones: bias and random error of acoustic voice parameters captured by different smartphone types.
      ,
      • Rusz J
      • Hlavnička J
      • Tykalova T
      • et al.
      Smartphone allows capture of speech abnormalities associated with high risk of developing Parkinson’s disease.
      ,
      • Uloza V
      • Padervinskis E
      • Vegiene A
      • et al.
      Exploring the feasibility of smart phone microphone for measurement of acoustic voice parameters and voice pathology screening.
      No difference in F0 was observed across devices and tasks, with a significant bias observed only in the case of RP task for the HD group, Figure 5. The observed bias, however, was small at 0.66 Hz between microphone and smartphone, and −0.53 Hz between microphone and tablet for mean values of 167.24 Hz for microphone, 167.90 Hz for smartphone, and 167.77 Hz for tablet, Table A.4. As the first harmonic in the frequency domain, the fundamental frequency can be clearly detected even in the presence of noise or interference. In contrast, HNR was consistently different across devices for all tasks in control and HD groups, Figures 3 to 8 and Tables A.1 to A.6. A bias in the HNR recorded using the microphone and smartphone was observed for all tasks in both groups. The bias in HNR was relatively large, at approximately −2 dB in all comparisons, Figures 3, 5, and 7.
      TABLE A.1Mean ± Standard Deviation of Estimated Acoustic Voice Features Estimated From the Sustained Vowel Phonation Task, and Their Respective P Value From Linear Mixed Models (LMM) Comparison Between Microphone and Smartphone, and Microphone and Tablet for Features Estimated From Recordings From the Control Group (* Indicates Significant Difference Following Bonferroni Correction)
      FeatureMean ± Standard DeviationP Values (LMM)
      MicrophoneSmartphoneTabletMicrophone Vs. SmartphoneMicrophone Vs. TabletSmartphone Vs. Tablet
      F0166.38 ± 49.53166.36 ± 49.54167.10 ± 50.030.9950.9950.995
      SD F01.11 ± 0.451.19 ± 0.631.96 ± 2.760.8360.0720.072
      HNR20.24 ± 3.2517.44 ± 3.4616.96 ± 3.32<0.001
      Significant difference.
      <0.001
      Significant difference.
      0.028
      Jitter0.19 ± 0.060.20 ± 0.080.33 ± 0.480.9300.0530.053
      RAP0.08 ± 0.020.09 ± 0.040.17 ± 0.280.9050.0350.035
      PPQ50.12 ± 0.030.13 ± 0.060.26 ± 0.470.9150.0560.056
      DDP0.24 ± 0.070.26 ± 0.120.52 ± 0.840.9050.0350.035
      Shimmer0.41 ± 0.250.54 ± 0.280.59 ± 0.29<0.001
      Significant difference.
      <0.001
      Significant difference.
      0.108
      APQ32.46 ± 1.543.21 ± 1.743.40 ± 1.83<0.001
      Significant difference.
      <0.001
      Significant difference.
      0.286
      APQ52.87 ± 1.753.77 ± 1.933.95 ± 1.84<0.001
      Significant difference.
      <0.001
      Significant difference.
      0.312
      APQ113.63 ± 2.034.78 ± 2.075.26 ± 2.07<0.001
      Significant difference.
      <0.001
      Significant difference.
      0.017
      DDA7.37 ± 4.629.63 ± 5.2310.21 ± 5.48<0.001
      Significant difference.
      <0.001
      Significant difference.
      0.286
      low asterisk Significant difference.
      TABLE A.2Mean ± Standard Deviation of Estimated Acoustic Voice Features Estimated From the Sustained Vowel Phonation Task, and Their Respective P Value From Linear Mixed Models (LMM) Comparison Between Microphone and Smartphone, and Microphone and Tablet From Recordings From Participants with HD (* Indicates Significant Difference Following Bonferroni Correction)
      FeatureMean ± Standard DeviationP Values (LMM)
      MicrophoneSmartphoneTabletMicrophone Vs SmartphoneMicrophone Vs TabletSmartphone Vs Tablet
      F0164.76 ± 35.05164.20 ± 33.97164.37 ± 34.510.7720.7720.772
      SD F012.27 ± 9.1610.27 ± 5.4111.89 ± 7.820.7180.7180.718
      HNR9.25 ± 2.5610.29 ± 2.319.79 ± 2.38

      0.002
      Significant difference.
      0.001
      Significant difference.
      0.574
      Jitter3.34 ± 2.892.50 ± 1.623.07 ± 2.340.5960.5960.596
      RAP1.46 ± 1.730.96 ± 0.971.26 ± 1.360.5370.4980.537
      PPQ51.62 ± 2.481.06 ± 1.211.43 ± 1.790.5650.5400.565
      DDP4.27 ± 5.222.79 ± 2.903.73 ± 4.070.5370.4980.537
      Shimmer1.07 ± 0.250.94 ± 0.201.01 ± 0.240.026
      Significant difference.
      0.003
      Significant difference.
      0.272
      APQ34.73 ± 1.363.89 ± 1.114.25 ± 1.280.001
      Significant difference.
      <0.001
      Significant difference.
      0.466
      APQ55.79 ± 1.264.86 ± 1.055.23 ± 1.22<0.001
      Significant difference.
      <0.001
      Significant difference.
      0.377
      APQ115.20 ± 2.844.45 ± 2.334.63 ± 2.62<0.001
      Significant difference.
      <0.001
      Significant difference.
      0.343
      DDA13.60 ± 3.8611.20 ± 3.1512.20 ± 3.750.001
      Significant difference.
      <0.001
      Significant difference.
      0.466
      low asterisk Significant difference.
      TABLE A.3Mean ± Standard Deviation of Estimated Acoustic Voice Features Estimated From the Reading Passage Task, and Their Respective P Value From Linear Mixed Models (LMM) Comparison Between Microphone and Smartphone, and Microphone and Tablet From the Control Group (* Indicates Significant Difference Following Bonferroni Correction)
      FeatureMean ± Standard DeviationP Value (LMM)
      MicrophoneSmartphoneTabletMicrophone Vs SmartphoneMicrophone Vs TabletSmartphone Vs Tablet
      F0156.28 ± 42.62156.62 ± 42.17155.96 ± 42.480.8010.8010.801
      SD F09.06 ± 7.0910.97 ± 8.4911.78 ± 8.360.0780.001
      Significant differences.
      0.099
      HNR11.05 ± 3.099.18 ± 2.788.45 ± 2.67<0.001
      Significant differences.
      < 0.001
      Significant differences.
      <0.001
      Significant differences.
      Jitter2.11 ± 1.652.87 ± 2.333.57 ± 3.310.018< 0.001
      Significant differences.
      0.007
      RAP0.73 ± 1.051.12 ± 1.301.57 ± 1.970.035< 0.001
      Significant differences.
      0.003
      Significant differences.
      PPQ50.84 ± 1.201.19 ± 1.601.61 ± 2.570.1710.003
      Significant differences.
      0.067
      DDP2.10 ± 3.093.20 ± 3.874.41 ± 5.780.044< 0.001
      Significant differences.
      0.009
      Shimmer0.90 ± 0.291.09 ± 0.301.22 ± 0.33<0.001
      Significant differences.
      < 0.001
      Significant differences.
      <0.001
      Significant differences.
      APQ33.49 ± 1.324.61 ± 1.365.40 ± 1.60<0.001
      Significant differences.
      < 0.001
      Significant differences.
      <0.001
      Significant differences.
      APQ54.49 ± 1.505.83 ± 1.446.61 ± 1.56<0.001
      Significant differences.
      < 0.001
      Significant differences.
      <0.001
      Significant differences.
      APQ113.96 ± 2.514.91 ± 3.205.76 ± 3.470.001
      Significant differences.
      < 0.001
      Significant differences.
      0.002
      Significant differences.
      DDA9.92 ± 3.6513.26 ± 3.8615.55 ± 4.52<0.001
      Significant differences.
      < 0.001
      Significant differences.
      <0.001
      Significant differences.
      low asterisk Significant differences.
      TABLE A.4Mean ± Standard Deviation of Estimated Acoustic Voice Features Estimated From the Reading Passage Task, and Their Respective P Value From Linear Mixed Models (LMM) Comparison Between Microphone and Smartphone, and Microphone and Tablet From Recordings From Participants With Hd (* Indicates Significant Difference Following Bonferroni Correction)
      FeatureMean ± Standard DeviationP Values (LMM)
      MicrophoneSmartphoneTabletMicrophone Vs SmartphoneMicrophone Vs TabletSmartphone Vs Tablet
      F0167.24 ± 40.83167.90 ± 40.77167.77 ± 40.790.9150.9150.915
      SD F04.40 ± 1.904.39 ± 1.194.13 ± 1.360.9760.6770.677
      HNR15.49 ± 4.2111.54 ± 3.6711.25 ± 3.93<0.001
      Significant difference.
      <0.001
      Significant difference.
      0.506
      Jitter0.77 ± 0.220.95 ± 0.290.90 ± 0.310.0580.1310.485
      RAP0.16 ± 0.050.27 ± 0.110.27 ± 0.150.0110.0110.979
      PPQ50.28 ± 0.100.39 ± 0.150.40 ± 0.180.0090.0060.656
      DDP0.46 ± 0.140.79 ± 0.330.80 ± 0.450.0080.0080.925
      Shimmer0.65 ± 0.210.93 ± 0.191.02 ± 0.25<0.001
      Significant difference.
      <0.001
      Significant difference.
      0.044
      APQ32.87 ± 1.134.31 ± 1.274.91 ± 1.72<0.001
      Significant difference.
      <0.001
      Significant difference.
      0.036
      APQ53.51 ± 1.445.45 ± 1.666.38 ± 2.17<0.001
      Significant difference.
      <0.001
      Significant difference.
      0.035
      APQ114.42 ± 1.346.28 ± 1.997.12 ± 2.880.0660.0150.349
      DDA8.33 ± 3.2712.55 ± 3.7214.38 ± 5.35<0.001
      Significant difference.
      <0.001
      Significant difference.
      0.056
      low asterisk Significant difference.
      TABLE A.5Mean ± Standard Deviation of Estimated Acoustic Voice Features Estimated From the Syllable Repetition Task, and Their Respective P Value From Linear Mixed Models (LMM) Comparison Between Microphone and Smartphone, and Microphone and Tablet From Recordings From the Control Group (* Indicates Significant Difference Following Bonferroni Correction)
      FeatureMean ± Standard DeviationP Value (LMM)
      MicrophoneSmartphoneTabletMicrophone Vs SmartphoneMicrophone Vs TabletSmartphone Vs Tablet
      F0167.75 ± 46.02169.50 ± 46.82170.64 ± 46.750.1980.0220.232
      SD F013.86 ± 18.1115.99 ± 21.1515.97 ± 21.570.1990.1990.969
      HNR10.08 ± 3.969.29 ± 3.568.64 ± 3.64<0.001
      Significant difference.
      <0.001
      Significant difference.
      <0.001
      Significant difference.
      Jitter4.71 ± 6.555.53 ± 6.966.09 ± 8.140.0810.003
      Significant difference.
      0.175
      RAP2.13 ± 3.772.56 ± 3.783.00 ± 4.650.0850.002
      Significant difference.
      0.085
      PPQ52.17 ± 3.742.43 ± 4.052.54 ± 4.410.6730.6730.683
      DDP6.11 ± 10.857.23 ± 10.858.20 ± 13.280.1790.0140.179
      Shimmer0.87 ± 0.551.03 ± 0.561.20 ± 0.78<0.001
      Significant difference.
      <0.001
      Significant difference.
      <0.001
      Significant difference.
      APQ33.16 ± 1.934.35 ± 2.235.03 ± 2.44<0.001
      Significant difference.
      <0.001
      Significant difference.
      <0.001
      Significant difference.
      APQ52.89 ± 2.404.03 ± 2.924.65 ± 3.18<0.001
      Significant difference.
      <0.001
      Significant difference.
      <0.001
      Significant difference.
      APQ111.60 ± 3.392.14 ± 3.792.52 ± 4.610.0470.002
      Significant difference.
      0.188
      DDA8.46 ± 5.2211.83 ± 6.1213.86 ± 6.93<0.001
      Significant difference.
      <0.001
      Significant difference.
      <0.001
      Significant difference.
      low asterisk Significant difference.
      TABLE A.6Mean ± Standard Deviation of Estimated Acoustic Voice Features Estimated From the Syllable Repetition Task, and Their Respective P Value From Linear Mixed Models (LMM) Comparison Between Microphone and Smartphone, and Microphone and Tablet From Recordings From Participants With Hd (* Indicates Significant Difference Following Bonferroni Correction)
      FeatureMean ± Standard DeviationP Values (LMM)
      MicrophoneSmartphoneTabletMicrophone Vs SmartphoneMicrophone Vs TabletSmartphone Vs Tablet
      F0200.88 ± 81.26207.59 ± 81.86208.24 ± 82.950.8720.8720.939
      SD F06.93 ± 4.797.52 ± 5.416.74 ± 4.800.7470.7470.747
      HNR15.20 ± 5.1011.66 ± 4.7011.10 ± 4.89<0.001
      Significant difference.
      <0.001
      Significant difference.
      0.062
      Jitter1.14 ± 1.061.28 ± 0.881.30 ± 1.040.9360.9360.936
      RAP0.29 ± 0.490.43 ± 0.400.54 ± 0.630.3850.2450.385
      PPQ50.41 ± 0.350.62 ± 0.510.61 ± 0.550.1510.1510.929
      DDP0.88 ± 1.471.30 ± 1.211.58 ± 1.900.4500.2950.450
      Shimmer0.65 ± 0.261.05 ± 0.411.18 ± 0.67<0.001
      Significant difference.
      <0.001
      Significant difference.
      0.129
      APQ32.77 ± 1.404.89 ± 2.596.79 ± 2.670.001
      Significant difference.
      <0.001
      Significant difference.
      0.138
      APQ53.84 ± 1.866.40 ± 2.776.79 ± 2.67<0.001
      Significant difference.
      <0.001
      Significant difference.
      0.268
      APQ115.20 ± 3.188.10 ± 3.908.35 ± 3.86<0.001
      Significant difference.
      <0.001
      Significant difference.
      0.714
      DDA8.28 ± 4.2314.56 ± 7.8217.04 ± 12.230.002
      Significant difference.
      <0.001
      Significant difference.
      0.131
      low asterisk Significant difference.
      Good agreement across devices was observed for the other frequency-derived features, SD F0, jitter, and its variants, RAP, PPQ5, and DDP, Figures 3, 5, and 7, in partial agreement with previous studies.
      • Oliveira G
      • Fava G
      • Baglione M
      • et al.
      Mobile digital recording: adequacy of the iRig and IOS device for acoustic and perceptual analysis of normal voice.
      ,
      • Jannetts S
      • Schaeffler F
      • Beck J
      • et al.
      Assessing voice health using smartphones: bias and random error of acoustic voice parameters captured by different smartphone types.
      ,
      • Rusz J
      • Hlavnička J
      • Tykalova T
      • et al.
      Smartphone allows capture of speech abnormalities associated with high risk of developing Parkinson’s disease.
      The exception to this was during SR for the control group in which significant differences were observed between microphone and tablet for the jitter variant RAP. Though significant bias was present for some features during SR, Figure 7, unlike linear mixed models, Bland-Altman analysis does not account for repeated measures within subjects or multiple comparisons, thus the mean of repeated measures must be estimated rather than using the individual data points.
      • Bland JM
      • Altman DG
      Statistical methods for assessing agreement between two methods of clinical measurement.
      Furthermore, the relative bias introduced by the tablet was relatively small (1.29%, 0.83%, and 1.96% for the jitter, RAP, and DDP, respectively).
      The amplitude-based features, shimmer and its variants, APQ3, APQ5, APQ11, and DDA, were found to be significantly different between microphone and smartphone, and between microphone and tablet recordings for all tasks across both groups, Figures 4, 6, and 8 and Tables A.1 to A.6. Significant bias was observed across all tasks and groups. This result is not unexpected since shimmer provides a measure of the variability in the amplitude of the signal.
      • Teixeira JP
      • Oliveira C
      • Lopes C
      Vocal acoustic analysis – jitter, shimmer and HNR parameters.
      The experimental setup used a smartphone and tablet placed at the same height as the participants mouth, however, the microphone was head mounted, so it was placed closer to the source, leading to a higher amplitude voice signal. Relatively higher levels of environmental noise were thus likely present in the mobile devices, even though noise was kept below 45 dBA. Despite the differences observed between devices for the amplitude-based features, correlation analysis revealed a high correlation (rho > 0.7, P < 0.001) for all features examined, apart from SD F0 for the RP task, Table 1. These differences in correlation across devices between RP and the other speech tasks may be due to the more dynamic nature of the task
      • Robin J
      • Harrison JE
      • Kaufman LD
      • et al.
      Evaluation of speech-based digital biomarkers: review and recommendations.
      which would lead to a greater variability in the acoustic features extracted from it, especially SD F0, which is the variance of the fundamental frequency.
      Previous comparison studies between mobile devices and microphone for acoustic voice analysis of healthy and pathological voices have reported conflicting results. Our results are in agreement with previous findings of high correlation between F0, SD F0, and jitter from microphone and Android smartphone devices.
      • Guidi A
      • Salvi S
      • Ottaviano M
      • et al.
      Smartphone application for the analysis of prosodic features in running speech with a focus on bipolar disorders: system performance evaluation and case study.
      However, the analysis in that study was conducted on running speech using a custom-made application for Android. Another study which analyzed SV and RP in a Dutch-speaking population found that F0 and jitter were consistent across devices, and HNR and shimmer were significantly different when comparing a head-mounted microphone with smartphone and tablet,
      • Maryn Y
      • Ysenbaert F
      • Zarowski A
      • et al.
      Mobile communication devices, ambient noise, and acoustic voice measures.
      which is in agreement with our findings.
      Our results are in partial agreement also with previous studies which reported that F0 and jitter are reliable measures across devices during sustained phonation of vowel [a:], but unlike our study, reported that shimmer could also be reliably estimated.
      • Jannetts S
      • Schaeffler F
      • Beck J
      • et al.
      Assessing voice health using smartphones: bias and random error of acoustic voice parameters captured by different smartphone types.
      ,
      • Rusz J
      • Hlavnička J
      • Tykalova T
      • et al.
      Smartphone allows capture of speech abnormalities associated with high risk of developing Parkinson’s disease.
      Our results indicate that F0 and jitter can be reliably estimated across devices but caution should be taken when estimating shimmer. This is in partial agreement with earlier results comparing signals recorded using an AKG microphone and Samsung smartphone during SV [a:] which observed that F0 and shimmer were consistent across devices in a control group, though jitter differed, and all the three features were consistent across devices in a cohort with pathological voices.
      • Uloza V
      • Padervinskis E
      • Vegiene A
      • et al.
      Exploring the feasibility of smart phone microphone for measurement of acoustic voice parameters and voice pathology screening.
      Also, in partial agreement with our findings, a previous study found no differences between any of the acoustic voice features estimated from the SV [a:] with iOS iPad and iPhone allied with an iRig when compared with a studio microphone allied with a preamplifier.
      • Oliveira G
      • Fava G
      • Baglione M
      • et al.
      Mobile digital recording: adequacy of the iRig and IOS device for acoustic and perceptual analysis of normal voice.
      This difference in results could be due higher performance and signal quality of the iRig which is an audio interface for musical instruments. In contrast to our findings, a previous study reported that HNR was consistent when measured with a head-mounted microphone and smartphone held by the participant.
      • Rusz J
      • Hlavnička J
      • Tykalova T
      • et al.
      Smartphone allows capture of speech abnormalities associated with high risk of developing Parkinson’s disease.
      This difference could be due to several factors, including differences in native language (Czech), or variability in the cohorts examined, which comprised of controls, PD, and sleep disorder participants.
      The effect of biological sex is expected on the fundamental frequency, with males typically having a lower F0 than females.
      • Nittrouer S
      • McGowan RS
      • Milenkovic PH
      • et al.
      Acoustic measurements of men's and women's voices.
      This was confirmed by differences in biological sex observed for fundamental frequency during the SV and SR tasks for the control group, Table A.7. However, they were not observed in SV and SR tasks performed by the participants with HD, Table A.7. The effect of sex might have been more pronounced in RP since it involves the phonation of different words that may have led to a greater sensitivity to biological sex, even with a small sample size which may not have been sufficient to distinguish between sex in highly structured tasks, such as SV and SR. In addition, while the effect of disease was not investigated in this study, it may have had an effect on the acoustic features estimated. An effect of biological sex was also present in HNR in the SV task, and in SD F0, HNR, jitter, RAP, PPQ5, DDP, and shimmer in the SR task. No effect of age was observed in any of the features across all tests and cohorts, Table A.8.
      TABLE A.7P Values for the Effect of Biological Sex on the Estimated Features (* Indicates Significant Difference Following Bonferroni Correction)
      FeatureSustained VowelReading PassageSyllable Repetition
      ControlHDControlHDControlHD
      F00.000 *0.067<0.001 *<0.001 *<0.001 *0.089
      SD F00.2860.2340.5510.2210.003 *0.761
      HNR0.000 *0.618<0.001 *0.061<0.001 *0.063
      Jitter0.0740.2620.0110.311<0.001 *0.577
      RAP0.2170.2060.0160.194<0.001 *0.810
      PPQ50.1560.2540.0250.183<0.001 *0.687
      DDP0.2170.2060.0160.173<0.001 *0.834
      Shimmer0.0180.855<0.001 *0.054<0.001 *0.123
      APQ30.0430.631<0.001 *0.0180.001 *0.082
      APQ50.0200.680<0.001 *0.0760.7830.090
      APQ110.0040.5930.003 *0.7420.0980.366
      DDA0.0430.631<0.001 *0.0290.0090.080
      TABLE A.8P Values for the Effect of Age on the Estimated Features (* Indicates Significant Difference Following Bonferroni Correction)
      FeatureSustained VowelReading PassageSyllable Repetition
      ControlHDControlHDControlHD
      F00.0670.1430.1310.0350.1280.374
      SD F00.0500.2880.1200.9340.1300.740
      HNR0.3720.6990.3850.9190.1110.557
      Jitter0.0660.2370.0550.8220.2520.679
      RAP0.1820.1670.0460.6890.2230.803
      PPQ50.0560.1710.0350.8630.0990.818
      DDP0.1820.1670.0440.6640.1850.824
      Shimmer0.4390.7800.6970.6320.7440.342
      APQ30.4670.5510.7940.2500.2120.199
      APQ50.3980.6640.4750.1310.0390.212
      APQ110.3830.7900.1020.0700.7690.149
      DDA0.4670.5510.6420.1900.0660.188
      TABLE A.9P Values for the Features Estimated From the Entire Signal for Sustained Vowel for Both Groups (* Indicates Significant Difference Following Bonferroni Correction)
      FeatureControlHD
      Microphone Vs. SmartphoneMicrophone vs. TabletSmartphone Vs. TabletMicrophone Vs. SmartphoneMicrophone Vs. TabletSmartphone Vs Tablet
      F00.9980.9980.9980.8730.8730.873
      SD F00.4480.0500.1470.4240.4240.766
      HNR0.000 *0.000 *0.0410.002 *0.001 *0.575
      Jitter0.3970.0370.1410.5140.5140.734
      RAP0.3630.0300.1370.4340.3760.615
      PPQ50.4610.0190.0670.3980.3200.595
      DDP0.3630.0300.1370.4340.3760.615
      Shimmer0.000 *0.000 *0.1240.004 *<0.001 *0.315
      APQ30.001 *0.000 *0.235<0.001 *<0.001*0.617
      APQ50.000 *0.000 *0.317<0.001 *<0.001 *0.558
      APQ110.000 *0.000 *0.047<0.001 *<0.001 *0.226
      DDA0.001 *0.000 *0.235<0.001 *<0.001 *0.617
      The linear mixed model,

      B Winter, “Linear models and linear mixed effects models in R with linguistic applications,” arXiv Prepr. arXiv1308.5499, 2013.

      together with Bland-Altman and correlation analysis, provides a more comprehensive evaluation of the effect of device on acoustic voice features than a single metric alone. The results of the linear mixed models were almost always in agreement with the Bland-Altman plots, confirming significant differences where a significant bias was observed. However, though linear mixed models can control for multiple factors and avoid pseaudoreplication, it is essentially a comparison of means. The correlation analysis showed that all the features were highly correlated across devices even where significant bias was observed. This is in line with previous findings of high correlation between acoustic voice features from microphone and smartphones examined for control participants and synthesized voice data.
      • Manfredi C
      • Lebacq J
      • Cantarella G
      • et al.
      Smartphones offer new opportunities in clinical voice research.
      ,
      • Lin E
      • Hornibrook J
      • Ormond T
      Evaluating iPhone recordings for acoustic voice assessment.
      Collectively the results suggest that frequency-based features can be accurately derived from recordings with mobile devices, but that amplitude-based features and HNR, though highly correlated across devices, may be sensitive to device type and location.

      CONCLUSIONS

      Acoustic voice features can be reliably estimated using mobile devices for sustain phonation of vowel [a:], reading of the Rainbow Passage and repeating syllables, [pa], [ta], [ka], [pataka], [pati], if consistency in device and device placement is ensured. Caution should be taken with HNR and amplitude-based features such as shimmer, APQ3, APQ5, APQ11, and DDA, as these features are sensitive to environmental noise and signal amplitude which is dependent on the position of the device and microphone quality.

      Acknowledgments

      The authors would like to acknowledge the DOMINO-HD Consortium, Bloomfield Health Services, and Neuromuscular Systems Group, in particular Muthukumaran Thangaramanujam, Caitlin McDonald, and Jeremy Liegey for assisting with data collection.

      Appendix A

      REFERENCES

        • Hou MY
        • Hurwirtz S
        Using daily text-message reminders to improve adherence with.
        Obstet Gynecol. 2010; 116 ([Online]. Available at:): 633-640
        • Reyes BA
        • Reljin N
        • Kong Y
        • et al.
        Tidal volume and instantaneous respiration rate estimation using a volumetric surrogate signal acquired via a smartphone camera.
        IEEE J Biomed Heal Informatics. 2017; 21: 764-777https://doi.org/10.1109/JBHI.2016.2532876
        • Nam Y
        • Kong Y
        • Reyes B
        • et al.
        Monitoring of heart and breathing rates using dual cameras on a smartphone.
        PLoS One. 2016; 11: 1-15https://doi.org/10.1371/journal.pone.0151013
        • Doheny EP
        • Lowery MM
        • Russell A
        • et al.
        Estimation of respiration rate and sleeping position using a wearable accelerometer.
        Proc Annu Int Conf IEEE Eng Med Biol Soc EMBS. 2020; 2020: 4668-4671https://doi.org/10.1109/EMBC44109.2020.9176573
        • Ginis P
        • Nieuwboer A
        • Dorfman M
        • et al.
        Feasibility and effects of home-based smartphone-delivered automated feedback training for gait in people with Parkinson’s disease: a pilot randomized controlled trial.
        Parkinsonism & related disorders. 2016; 22: 28-34https://doi.org/10.1016/j.parkreldis.2015.11.004
        • Larson EC
        • Goel M
        • Boriello G
        • et al.
        SpiroSmart: using a microphone to measure lung function on a mobile phone.
        in: Proceedings of the 2012 ACM Conference on Ubiquitous Computing. 2012: 280-289
        • Vatanparvar K
        • Nathan V
        • Nemati E
        • et al.
        SpeechSpiro: lung function assessment from speech pattern as an alternative to spirometry for mobile health tracking.
        in: Proc Annu Int Conf IEEE Eng Med Biol Soc EMBS. IEEE, 2021: 7237-7243https://doi.org/10.1109/EMBC46164.2021.9630077
        • Horin AP
        • McNeely ME
        • Harrison EC
        • et al.
        Usability of a daily mHealth application designed to address mobility, speech and dexterity in Parkinson’s disease.
        Neurodegener Dis Manag. 2019; 9: 97-105https://doi.org/10.2217/nmt-2018-0036
        • Wilson R
        • Small J
        Care staff perspectives on using mobile technology to support communication in long-term care: mixed methods study.
        JMIR Nurs. 2020; 3: e21881https://doi.org/10.2196/21881
        • Hussein WF
        • et al.
        The mobile health readiness of people receiving in-center hemodialysis and home dialysis.
        Clin J Am Soc Nephrol. 2021; 16: 98-106https://doi.org/10.2215/CJN.11690720
        • Volkmann J
        • Hefter H
        • Lange HW
        • et al.
        Impairment of temporal organization of speech in basal ganglia diseases.
        Brain Lang. 1992; 43: 386-399https://doi.org/10.1016/0093-934X(92)90108-Q
        • Scott Kelso JA
        • Tuller B
        • Harris KS
        A ‘dynamic pattern’ perspective on the control and coordination of movement.
        The Production of Speech. 1983; : 137-173https://doi.org/10.1007/978-1-4613-8202-7_7
        • Smith A
        • Mcfarland DH
        • Weber CM
        Interactions between speech and finger movements.
        J Speech Lang Hear Res. 1986; 29: 471-480https://doi.org/10.1044/jshr.2904.471
        • Oliveira G
        • Fava G
        • Baglione M
        • et al.
        Mobile digital recording: adequacy of the iRig and IOS device for acoustic and perceptual analysis of normal voice.
        J. Voice. 2017; 31: 236-242https://doi.org/10.1016/j.jvoice.2016.05.023
        • Vogel AP
        • Rosen KM
        • Morgan AT
        • et al.
        Comparability of modern recording devices for speech analysis: smartphone, landline, laptop, and hard disc recorder.
        Folia Phoniatr Logop. 2014; 66: 244-250https://doi.org/10.1159/000368227
        • Jannetts S
        • Schaeffler F
        • Beck J
        • et al.
        Assessing voice health using smartphones: bias and random error of acoustic voice parameters captured by different smartphone types.
        Int J Lang Commun Disord. 2019; 54: 292-305https://doi.org/10.1111/1460-6984.12457
        • Zhang C
        • Jepson K
        • Lohfink G
        • et al.
        Comparing acoustic analyses of speech data collected remotely.
        J Acoust Soc Am. 2021; 149: 3910-3916https://doi.org/10.1121/10.0005132
        • Schaeffler F
        • Jannetts S
        • Beck J
        Reliability of Clinical Voice Parameters Captured With Smartphones – Measurements of Added Noise and Spectral Tilt.
        Proc. 20th Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH. ISCA, Graz, Austria2019: 2523-2527
        • Manfredi C
        • Lebacq J
        • Cantarella G
        • et al.
        Smartphones offer new opportunities in clinical voice research.
        J Voice. 2017; 31: 111.e1-111.e7https://doi.org/10.1016/j.jvoice.2015.12.020
        • Lebacq J
        • Schoentgen J
        • Cantarella G
        • et al.
        Maximal ambient noise levels and type of voice material required for valid use of smartphones in clinical voice research.
        J Voice. 2017; 31: 550-556https://doi.org/10.1016/j.jvoice.2017.02.017
        • Kim G-H
        • Kang D-H
        • Lee Y-Y
        • et al.
        Recording quality of smartphone for acoustic analysis.
        J Clin Otolaryngol Head Neck Surg. 2016; 27: 286-294https://doi.org/10.35420/jcohns.2016.27.2.286
        • Maryn Y
        • Ysenbaert F
        • Zarowski A
        • et al.
        Mobile communication devices, ambient noise, and acoustic voice measures.
        J Voice. 2017; 31: 248.e11-248.e23https://doi.org/10.1016/j.jvoice.2016.07.023
        • van der Woerd B
        • Wu M
        • Parsa V
        • et al.
        Evaluation of acoustic analyses of voice in nonoptimized conditions.
        J Speech Lang Hear Res. 2020; 63: 3991-3999https://doi.org/10.1044/2020_JSLHR-20-00212
        • Uloza V
        • Ulozaitė-Stanienė N
        • Petrauskas T
        • et al.
        Accuracy of acoustic voice quality index captured with a smartphone – measurements with added ambient noise.
        J Voice. 2021; (In press)https://doi.org/10.1016/j.jvoice.2021.01.025
        • Bocklet T
        • Steidl S
        • Nöth E
        • et al.
        Automatic evaluation of parkinson's speech - acoustic, prosodic and voice related cues.
        in: Proc Annu Conf Int Speech Commun Assoc INTERSPEECH. 2013: 1149-1153https://doi.org/10.21437/interspeech.2013-313
        • Portnoy RA
        • Aronson AE
        Diadochokinetic syllable rate and regularity in normal and in spastic and ataxic dysarthric subjects.
        J Speech Hear Disord. 1982; 47: 324-328
        • Orozco-Arroyave JR
        • Vásquez-Correa JC
        • Klumpp P
        • et al.
        Apkinson: the smartphone application for telemonitoring Parkinson’s patients through speech, gait and hands movement.
        Neurodegener Dis Manag. 2020; 10: 137-157https://doi.org/10.2217/nmt-2019-0037
        • Tsanas A
        • Little MA
        • McSharry PE
        • et al.
        Accurate telemonitoring of Parkinsons disease progression by noninvasive speech tests.
        IEEE Trans Biomed Eng. 2010; 57: 884-893https://doi.org/10.1109/TBME.2009.2036000
        • Rusz J
        • Hlavnička J
        • Tykalova T
        • et al.
        Smartphone allows capture of speech abnormalities associated with high risk of developing Parkinson’s disease.
        IEEE Trans Neural Syst Rehabil Eng. 2018; 26: 1495-1507https://doi.org/10.1109/TNSRE.2018.2851787
        • Jeancolas L
        • Mangone G
        • Corvol JC
        • et al.
        Comparison of telephone recordings and professional microphone recordings for early detection of Parkinson’s disease, using mel-frequency cepstral coefficients with Gaussian mixture models.
        in: INTERSPEECH 2019: 20th Annual Conference of the International Speech Communication Association. International Speech Communication Association (ISCA), 2019: 3033-3037
        • Patel RH
        • Rita R
        • Awan SN
        • et al.
        Recommended protocols for instrumental assessment of voice: American Speech- Language-Hearing Association Expert Panel to Develop a Protocol for Instrumental Assessment of Vocal Function.
        Am J Speech Lang Pathol. 2018; 27: 887-905https://doi.org/10.1044/2018_AJSLP-17-0009
        • Fairbanks G
        The rainbow passage.
        Voice Articul Drillb. 1960; 2: 127
        • Rusz J
        • Tykalova T
        • Ramig LO
        • et al.
        Guidelines for speech recording and acoustic analyses in dysarthrias of movement disorders.
        Mov Disord. 2021; 36: 803-814https://doi.org/10.1002/mds.28465
        • Kaiser JF
        On a simple algorithm to calculate the ‘energy’ of a signal.
        ICASSP IEEE Int Conf Acoust Speech Signal Process - Proc. 1990; 1: 381-384https://doi.org/10.1109/icassp.1990.115702
        • Tsanas A
        • Little MA
        • McSharry PE
        • et al.
        Nonlinear speech analysis algorithms mapped to a standard metric achieve clinically useful quantification of average Parkinson's disease symptom severity.
        J R Soc Interface. 2011; 8: 842-855https://doi.org/10.1098/rsif.2010.0456
        • Rusz J
        • Saft C
        • Schlegel U
        • et al.
        Phonatory dysfunction as a preclinical symptom of huntington disease.
        PLoS One. 2014; 9: 1-7https://doi.org/10.1371/journal.pone.0113412
        • Boersma P
        Acurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound.
        IFA Proc. 1993; 17 ([Online]. Available at:): 97-110
        • de Cheveigné A
        • Kawahara H
        YIN, a fundamental frequency estimator for speech and music.
        J Acoust Soc Am. 2002; 111: 1917-1930https://doi.org/10.1121/1.1458024
        • Kasi K
        • Zahorian SA
        Yet another algorithm for pitch tracking.
        ICASSP IEEE Int Conf Acoust Speech Signal Process - Proc. 2002; 1: 361-364https://doi.org/10.1109/icassp.2002.5743729
        • De Cheveigné
        Speech f0 extraction based on Licklider's pitch perception model.
        in: Proc. ICPhS. 1991: 3-6 ([Online]. Available at:)
        • McFee B
        • Raffel C
        • Liang D
        • et al.
        librosa: audio and music signal analysis in python.
        in: Proc 14th Python Sci Conf. 2015: 18-24https://doi.org/10.25080/majora-7b98e3ed-003
        • Brookes M
        Voicebox: speech processing toolbox for matlab.
        Software. 1997; 47 (Available at: [Mar. 2011] from): 45
        • Bland JM
        • Altman DG
        Statistical methods for assessing agreement between two methods of clinical measurement.
        Int J Nurs Stud. 2010; 47: 931-936https://doi.org/10.1016/j.ijnurstu.2009.10.001
      1. B Winter, “Linear models and linear mixed effects models in R with linguistic applications,” arXiv Prepr. arXiv1308.5499, 2013.

      2. D Bates, M Mächler, B Bolker, et al. “Fitting linear mixed-effects models using lme4,” arXiv Prepr. arXiv1406.5823, 2014.

        • Uloza V
        • Padervinskis E
        • Vegiene A
        • et al.
        Exploring the feasibility of smart phone microphone for measurement of acoustic voice parameters and voice pathology screening.
        Eur Arch Oto-Rhino-Laryngol. 2015; 272: 3391-3399https://doi.org/10.1007/s00405-015-3708-4
        • Teixeira JP
        • Oliveira C
        • Lopes C
        Vocal acoustic analysis – jitter, shimmer and HNR parameters.
        Proc Technol. 2013; 9: 1112-1122https://doi.org/10.1016/j.protcy.2013.12.124
        • Robin J
        • Harrison JE
        • Kaufman LD
        • et al.
        Evaluation of speech-based digital biomarkers: review and recommendations.
        Digit Biomark. 2020; 4: 99-108https://doi.org/10.1159/000510820
        • Guidi A
        • Salvi S
        • Ottaviano M
        • et al.
        Smartphone application for the analysis of prosodic features in running speech with a focus on bipolar disorders: system performance evaluation and case study.
        Sensors (Switzerland). 2015; 15: 28070-28087https://doi.org/10.3390/s151128070
        • Nittrouer S
        • McGowan RS
        • Milenkovic PH
        • et al.
        Acoustic measurements of men's and women's voices.
        J Speech Lang Hear Res. 1990; 33: 761-775https://doi.org/10.1044/jshr.3304.761
        • Lin E
        • Hornibrook J
        • Ormond T
        Evaluating iPhone recordings for acoustic voice assessment.
        Folia Phoniatr Logop. 2012; 64: 122-130https://doi.org/10.1159/000335874