| | The Effect of Noise on Computer-Aided Measures of Voice: A Comparison of CSpeechSP and the Multi-Dimensional Voice Program Software Using the CSL 4300B Module and Multi-Speech for Windows☆Accepted 13 May 2002. Abstract Summary: The effect of noise on computer-derived samples of voice was compared across three different hardware/software configurations. The hardware/software systems included a stand-alone A/D converter (CSL Module 4300B) coupled to a custom Pentium PC used in conjunction with the Multi-Dimensional Voice Program (MDVP) software, and a Creative Labs A/D converter coupled to the same custom PC under software control of MDVP/Multispeech and CSpeechSP. Voice samples were taken from 10 female subjects, then mixed with computer fan noise creating three different signal-to-noise (S/N) levels. Mixed signals were analyzed on the three hardware/software systems. Results revealed that fundamental frequency was most resistant to the degradation effect of noise across systems; jitter and shimmer values, however, were more variable across all configurations. Jitter and shimmer values were significantly higher under certain S/N levels for the MDVP 4300B based system as compared to MDVP for Multi-Speech and CSpeechSP. The findings punctuate the need for sensitivity to recording environments, careful selection of hardware/software equipment arrays, and the establishment of minimal recording conditions (>25dBA S/N) for voice sampling and analysis using computer-assisted methods.
INTRODUCTION  Perturbation measures such as jitter and shimmer as well as fundamental frequency (F0) have been used to estimate a variety of vocal fold conditions (normal and pathological) for the last three decades. In the past, obtaining these estimates required expensive hardware and software. Thus, instrumental analysis of voice was most often conducted at large clinical institutions or research institutions.1 As technology developed, computer-assisted programs to obtain estimates of voice became less expensive, and more accessible to the practicing clinician. With the increasing popularity of computer analysis systems came a growing concern about the reliability and validity of estimates derived from these systems. Perturbation values may be altered through hardware or software configurations, and this in turn may influence the accuracy of results (for example, see Titze and Liang2). Findings from a number of studies have shown that there are many factors that can affect computer-derived perturbation measurements. Factors include: (a) the type of recording system used;3., 4., 5., 6. (b) microphone type, placement, and angle;7 (c) analysis window length;6., 8. (d) F0 extraction methods and associated algorithms for determining voice statistics;2., 9., 10., 11. (e) environmental noise;12., 13., 14., 15. and (f) the effects of internal computer noise on signal conversions.16 In this study we were most concerned with the effects of environmental noise on perturbation estimates from three signal-processing computer configurations. An optimal recording environment consists of a sound-treated booth with shields against ambient noise.13., 14. It is likely, however, that as computer analysis systems have become more widely used in nonresearch clinical settings, voice samples may not be collected in the most desirable environments. The effect of capturing samples in less than ideal settings is that noise-induced error can occur, threatening the validity of computer-generated measures of voice. In general, this type of error is additive in nature.13., 15. It has been suggested that environments with signal-to-noise (S/N) ratios of greater than 15 dBA may not be appropriate for voice recordings.13 That suggestion, however, was based on results from an experiment that examined voice estimates derived from one computer software program, Kay Elemetrics Multi-Dimensional Voice Program (MDVP) 4305,16 in the presence of varying S/N conditions. In that study, data were acquired and processed through Computerized Speech Lab (CSL) using an external module. Findings were specific to the CSL 4300B. It is important to determine whether different software and hardware configurations might be more or less vulnerable to the effects of external (environmental or processing/component related) and internal (computer-generated) noise. The purpose of this investigation was to determine if computer-derived estimates of voice statistics were different across three hardware/software configurations. Stimuli consisted of voice samples mixed with external environmental noise. The system arrays compared included two configurations manufactured by Kay Elemetrics, namely, the Multi-Dimensional Voice Program (MDVP) option for CSL 4300B,16 and MDVP option for Multi-Speech which runs via Microsoft Windows.17 CSpeechSP,18 a Windows-based software program developed by Paul Milenkovic, was also used. The Kay Elemetrics products were selected because of their frequency of use in research and clinical environments. Further, Kay Elemetrics markets CSL as its “best” speech analysis system. CSpeechSP was used because it allows the importation of Kay Elemetric file formats so that all signals could be captured through a defined hardware configuration, and analyzed using a least mean squares algorithm.11
METHODS  Stimuli The stimuli used in this study consisted of computer-generated fan noise mixed with voice samples. Fan noise was collected from a clinical laboratory computer (custom made 486 DX266 system) and recorded on a Tascam cassette recorder (112 MK II (TEAC Corporation, Montebello, CA)) using a Shure microphone (SM48 (Shure Brothers Incorporated, Evanston, IL)) that was positioned approximately 15 centimeters from the fan. The center frequency of the noise was approximately 235 Hz. Subjects Voice samples from 10 female subjects, ages 19 to 25 (M = 22) years old were used in this experiment. The average fundamental frequency of women's voices approached the center frequency of the noise. All subjects met the following criteria: (1) a history of normal speech and hearing (verified on the day of the recording), (2) normal health history as well as good health at the time of the experiment, (3) no history of smoking, (4) no professional voice training, (5) ability to match the fundamental frequency (F0) of a generated 235 Hz triangular tone (Hewlett Packard 33120A function/arbitrary waveform) within ±5 Hz, and (6) ability to maintain an acoustic intensity at 85 dBA ±2 dB (Bruel and Kjaer 2230 Sound Level Meter). Subjects passed a hearing screening before the voice recording took place (20 dB HL for the frequencies of 500, 1000, 2000, and 4000 Hz). Procedure Voice acoustic signals were recorded in a sound-treated booth (Industrial Acoustics Corporation, Santa Cruz, CA). The noise stimulus was captured in a 20′×12′×9′ laboratory room. Ambient noise level in the booth was 35 dBA, and 58 dBA in the laboratory room (Bruel and Kjaer 2230 sound level meter). The triangular wave (Hewlett Packard 33120A) used for pitch matching was amplified (Yamaha P2075) and delivered monaurally into the right earphone of a Telephonics TDH-50 headphone (after Gelfand19). Instrument calibration checks were performed daily (Quest Electronics 08-45 Octave Band Filter with a 9A coupler and a FC-3 frequency counter). For the voice recording, a microphone (Shure SM 48) was positioned 10 cm from the mouth of each subject. The microphone of the sound level meter was positioned directly above the Shure microphone, and at the same distance to the mouth of the subjects. Each subject was instructed to sustain voiced /a/ for 10 seconds while matching the pitch, within ±5 Hz, of the generated triangular tone at a loudness level of 85 dBA. Each subject was also given the opportunity to monitor pitch by viewing the digital display of the frequency counter (Quest Electronics FC-3frequency counter). Ten productions per subject were recorded to a digital audiotape (DAT) recorder (Panasonic SV-255; 16-bit resolution at a sample rate of 48 kHz) located outside the booth. Once recorded, each of the subject's 10 productions were analyzed directly from the DAT recorder through MDVP for Multi-Speech to verify that the subject met the F0 requirements, and that their relative average perturbation (RAP) and amplitude perturbation quotient (APQ) values were within “normal range” (RAP<0.68 and APQ<3.07, Kay Elemetrics, Lincoln Park, NJ).16 Peak-amplitude variation (vAm) was also considered in choosing samples. A single production from each subject was then selected that varied less than 2 Hz from an average of 235 Hz while maintaining the lowest RAP, APQ, and vAm values. To establish signal-to-noise (S/N) ratio conditions, each selected voice sample and recorded noise sample were amplified (Yamaha P2075) and played through JBL speakers (Model Control 5). A microphone (Electro-Voice Model RE20) was placed 10 cm from the speaker source. The Electro-Voice microphone was used in place of the original recording microphone (Shure SM 48) to optimize impedance match between the microphone and the Creative Labs AWE64 sound card used to digitize signals from MDVP for Multi-Speech and CSpeechSP. The sound level meter was directed toward the speakers and positioned at approximately the same level as the microphone. Three different S/N levels were established (25, 20, and 15 dB S/N) and digitized directly into three systems. The S/N levels used were selected on the basis of clinical and laboratory experience with noise floors. The sound level of voice stimuli was verified (Bruel and Kjaer 2230 sound level meter) and maintained at 85 dB SPL, while noise levels were selected to create the S/N levels. Each voice stimulus was captured initially without the presence of the experimental noise. The nominal S/N level for this capture was 50 dB SPL or the amplitude of the voice signal (85 dB SPL) minus the noise floor in the booth (35 dB SPL). Mixed signals were then captured directly and sequentially by the following three computer arrangements, (1) MDVP for CSL 4300B (48 kHz sample rate), (2) MDVP for Multi-Speech (48 kHz sample rate), and (3) CSpeechSP (44.1 KHz sample rate). All samples had 16-bit resolution. In the CSpeechSP configuration, it was necessary to amplify the microphone signals using a Mic-Line Driver (FP-MLD). All three systems operated on the same computer (i.e., same mother board, same power supply, same monitor and monitor distance to sound boards). MDVP for Multi-Speech and CSpeechSP shared a Creative Labs Sound Blaster AWE64 audio card, whereas, MDVP for 4300B used the external Kay Elemetrics module for signal acquisition. The mixed signals were reduced to a 3-second mid-portion for subsequent processing. Analysis Measures of F0, jitter, and shimmer were used in this study. Although CSpeechSP and the Kay Elemetrics software programs use different methods for calculating these three measures, the acoustic signals were analyzed only using CSpeechSP. Thus, all measures were derived from the least mean squares formula.11 Signals captured into the Kay Elemetrics software and hardware arrangements (MDVP for CSL 4300B and for Windows) were imported to CSpeechSP. A comparison between the waveforms of the original and imported versions showed no file corruption. The CSpeechSP program expresses jitter in milliseconds and shimmer in percent. Repeated Measures MANOVAs were used to determine significant within subject differences on each variable (F0, jitter, and shimmer) by software programs/hardware arrangements (device) and by S/N levels (condition) (SPSS 9.0 for Windows). Within subjects post hoc comparisons with associated Bonferroni20 adjustments were used when significant main effects were found. Dependent t-tests with associated Bonferroni adjustments were used when significant interaction effects were noted. An alpha level of <0.05 was established.
RESULTS  Means and standard deviations for F0, jitter, and shimmer at each S/N level for each device are shown in Table 1. For F0, values remained stable regardless of the device or S/N condition. Jitter values in milliseconds were relatively stable at the 25 and 20 dB S/N conditions, and increased in the worst S/N condition (15 dB) across devices. Jitter values are also expressed in percent and displayed in Table 1. We converted the average jitter values from milliseconds to percent to provide information on jitter per recommendations from the National Center for Speech and Voice.21 Shimmer values followed the trend of increasing values as noise levels increased across devices. Jitter Significant main effects for device (Wilks' lambda = 0.42, F = 5.43, P = 0.03) and condition (Wilks' Lambda = 0.07, F = 32.32, P = <0.001) were found; however, there was a significant interaction effect (Wilks' lambda = 0.09, F = 6.66, P = 0.04). Results from dependent t-tests comparing jitter values at different S/N levels per computer software system are displayed in Table 2. Those results revealed that: (a) For the MDVP 4300B unit, significant differences were noted between jitter values recorded with no additive noise (50 dB) and those at 15 dB S/N, between 25 dB and 20 dB S/N, between 25 dB and 15 dB S/N, and between 20 dB and 15 dB S/N; (b) For MDVP operating with Multi-Speech three of the jitter comparisons were significant; those differences were noted between jitter values recorded with no additive environmental noise (50 dB) and those at 15 dB S/N, between 25 dB and 15 dB S/N, and between 20 dB and 15 dB S/N; (c) for the CSpeechSP program, four of the six comparisons were significant: those recorded with no additive noise (50 dB) versus 20 dB S/N, no additive noise (50 dB) versus 15 dB S/N, 25 versus 20 dB S/N, and 25 versus 15 dB S/N. In all cases, jitter values recorded with additive noise were higher than those estimated without the presence of additive environmental noise. As shown in Table 3, only one significant difference was found when making paired comparisons of the three computer-ized voice analysis systems at each of the S/N condi-tions. Jitter values recorded on MDVP for CSL 4300B were significantly higher than those recorded on CSpeechSP at 15 dB S/N (see Table 1 for mean values). Shimmer Main effects and the interaction effect were significant for shimmer values (device, Wilks' lambda = 0.11, F = 32.55, P = >0.001; condition, Wilks' lambda = 0.01, F = 227.55, P = >0.001; device by condition, Wilks' lambda = 0.08, F = 7.39, P = 0.04). Post hoc testing, comparing shimmer values recorded per device at each of the S/N conditions, revealed that all comparisons were significantly different (see Table 4). Significant differences were found on all comparisons between the varying S/N conditions, regardless of the computerized voice analysis system. The trend was as noise floors increased, shimmer values increased, regardless of the analysis system. When making pair-wise comparisons among the devices with S/N level held constant, as shown in Table 5, significant differences were noted between the MDVP for the 4300B unit and MDVP for Multi-Speech at 25 dB and 15 dB S/N conditions. In both instances, shimmer values recorded from the 4300B unit were higher than those from MDVP for Multi-Speech, which can be seen in Table 1. Note that differences between these two devices at the 20 dB S/N level approached significance (P = 0.06). Likewise, values taken from voice analyses using MDVP for the 4300B unit were significantly higher than those recorded from CSpeechSP at the 25, 20, and 15 dB S/N levels.
DISCUSSION  As has been found previously,13., 15. F0 was the most resistant measure to the effects of additive noise, while jitter and shimmer values were adversely affected by noise background across the three hardware/software configurations. The systems yielded comparable estimates of F0 and jitter at each of the different S/N levels with one exception. The MDVP software program using the 4300B unit had significantly higher jitter values than those recorded from CSpeechSP at the worst S/N condition (15 dB). Although shimmer values were affected by noise conditions on all three software systems, those values were significantly higher (or approached significance) when taken from MDVP using the 4300B unit as compared to MDVP for Multi-Speech and CSpeechSP. This was a surprising finding, since the 4300B unit uses an external analog-to-digital (A/D) converter module. Literature from Kay Elemetrics17 indicates that when signal conversion is done outside the computer, away from sources of electronic noise such as the mother board and power supply, the signal may be less corrupted by noise than signals processed by internal sound cards. According to this manufacturer, plug-in cards are more susceptible to problems such as polarity, drift, power supply hum, electromagnetic noise, as well as fan noise. To our knowledge, these assumptions have not been empirically tested, and thus should be interpreted with extreme caution. In fact, location of the digitization board may have little, if any, effect on interference from noise. Our results seem to support this last notion. In this experiment, signals processed through MDVP using the 4300B unit were not less susceptible to noise than signals digitized through the sound card located in the computer. The increased availability and affordability of computers and Windows-based software programs make the job of analyzing voice disorders and tracking progress seductively quick and easy. The numeric output of such software programs, however, is only as accurate (and clinically valuable) as the weakest link in the recording and signal processing ensemble. The results of this study contribute to clinical knowledge and add further support for at least a two-prong approach to voice description, that is, coupling computer-assisted analysis with clinical judgment. Findings from this investigation suggest that the type of system is an issue when recording in the presence of noise, especially for shimmer estimates because they are particularly sensitive to noise-induced error (see Table 5). We found that voice estimates from MDVP for Multi-Speech and CSpeechSP were similarly affected by increasing noise floors. Results from MDVP using the 4300B unit, however, proved to be most affected by the presence of environmental noise. At an S/N level of 50 dB, shimmer estimates appear to be valid for all three of the software/hardware configurations studied. At the S/N levels of 25 dB or less, however, noise-induced error was found for shimmer results taken from the 4300B unit. Noise induced errors pose a threat to the analysis of voice disorders. When types of noise (e.g., environmental) are added to the voice signal, estimates of statistical thresholds are susceptible to false-positives. It is possible for voices that are “normal” to be identified as “pathological” because of the negative (mainly additive) effects of the noise.15 Use of computer software to track a patient's progress requires careful consideration of the recording conditions each time a voice sample is collected. Because Windows-based applications utilize a wide range of multi-purpose sound cards, it is essential to examine the specifications of the sound card installed. In this study a Creative Labs Sound Blaster AWE64 16-bit card was used. A different sound card may generate similar results to ours, or introduce new concerns. The type of background environmental noise present during voice recording is another issue worthy of consideration. Because the type of noise is directly related to the frequency specificity and amplitude, it is possible that other results may be obtained when qualitatively different noise biases are present. In this study we investigated the effects on perturbation measurements in the presence of the turbulent noise of a common computer fan. However, impulse noises such as an elevator start-up, people talking, or foot steps in a hallway could have different effects on the computer analysis results.
CONCLUSION  This study underscores the historical concern for audio recordings made in the presence of noise. Our findings contribute to that concern by demonstrating the relation between additive noise and measures of fundamental frequency, jitter, and shimmer. Appropriate recording standards are needed so that valid and reliable results are obtained relative to voice sources, recording processes, and analysis systems.
Acknowledgements  The authors wish to thank E.J. McDonald, former Clinic Director in Audiology, University of Wyoming, for his assistance with monitoring sound levels. We also want to thank Brian Sinicki, graduate of Computer Science, University of Wyoming, for his technical assistance with hardware/software interface. Dr. Alan Moore, University of Wyoming, is acknowledged for his statistical advice. Thanks are extended to Lee Butero and Natasha Clatterbaugh, University of Northern Colorado, for their editorial contributions to the manuscript. Part of this project was funded by the Kahn Foundation, University of Wyoming. References  1..
1.
Karnell MP
.
Fundamental frequency and perturbation measurement
.
Seminars Speech Lang
. 1991;12:88–97
.
2..
2.
Titze IR
, Liang H
.
Comparison of F0 extraction methods for high-precision voice perturbation measurements
.
J Speech Hear Res
. 1993;36:1120–1133
.
MEDLINE 3..
3.
Doherty ET
, Shipp T
.
Tape recorder effects on jitter and shimmer extraction
.
J Speech Hear Res
. 1988;31:485–490
.
MEDLINE 4..
4.
Gelfer PG
, Fendel DM
.
Comparisons of jitter, shimmer, and signal-to-noise ratio from directly digitized versus taped voice samples
.
J Voice
. 1995;9:378–382
.
Abstract |
Full-Text PDF (427 KB)
|
CrossRef
5..
5.
Perry CK
, Ingrisano DR-S
, Blair WB
.
The influence of recording systems on jitter and shimmer estimates
.
Am J Speech-Lang Pathol
. 1996;5(2):86–90
.
6..
6.
Titze IR
, Horii Y
, Scherer R
.
Some technical considerations in voice perturbation measurements
.
J Speech Hear Res
. 1987;30:252–260
.
MEDLINE 7..
7.
Titze IR
, Winholtz WS
.
Effect of microphone type and placement on voice perturbation measurements
.
J Speech Hear Res
. 1993;36:1177–1190
.
MEDLINE 8..
8.
Karnell MP
.
Laryngeal perturbation analysis: minimum length of window analysis
.
J Speech Hear Res
. 1991;34:544–548
.
MEDLINE 9..
9.
Bielamowicz S
, Kreiman J
, Gerratt BR
, Dauer MS
, Berke GS
.
Comparison of voice analysis systems for perturbation measurement
.
J Speech Hear Res
. 1996;39:126–134
.
MEDLINE 10..
10.
Karnell MP
, Hall KD
, Landahl KL
.
Comparison of fundamental frequency and perturbation measurements among three analysis systems
.
J Voice
. 1995;9:383–393
.
Abstract |
Full-Text PDF (811 KB)
|
CrossRef
11..
11.
Milenkovic P
.
Least mean square measures of voice perturbation
.
J Speech Hear Res
. 1987;30:529–538
.
MEDLINE 12..
12.
Ingrisano DR-S
, Perry CK
, McDonald EJ
.
Establishing clinical standards in recording speech and voice: one approach
.
Rocky Mountain J Comm Dis
. 1995;11(Fall):17–21
.
13..
13.
Ingrisano DR-S
, Perry CK
, Jepson KR
.
Environmental noise: a threat to automatic voice analysis
.
Am J Speech-Lang Pathol
. 1998;7(1):91–96
.
14..
14.
Laver J
, Hiller S
, Beck JM
.
Acoustic waveform perturbations and voice disorders
.
J Voice
. 1992;6:115–126
.
Abstract |
Full-Text PDF (1194 KB)
|
CrossRef
15..
15.
Perry CK
, Ingrisano DR-S
, Palmer MA
, McDonald EJ
.
Effects of environmental noise on computer-derived voice estimates from female speakers
.
J Voice
. 2000;14:146–153
.
Abstract |
Full-Text PDF (684 KB)
|
CrossRef
16..
16.
Kay Elemetrics Corporation
.
Operations Manual: Multi-Dimensional Voice Program (MDVP) Model 4305
. Lincoln Park, NJ: Kay Elemetrics Corporation; 1993;
.
17..
17.
Kay Elemetrics Corporation
.
Software Instruction Manual: Multispeech, Model 3700 CSL for Windows, Models 4100, 4300B, Version 2.3
. Lincoln Park, NJ: Kay Elemetrics Corporation; 1999;
.
18..
18.
Milenkovic P
.
CSpeechSP User's Manual
. Madison, WI: Paul Milenkovic; 1997;
.
19..
19.
Gelfand SA
.
Hearing: an introduction of psychological and physiological acoustics
. New York, NY: Marcel Dekker, Inc; 1981;
.
20..
20.
Dunn OJ
.
Multiple comparisons among means
.
J Am Stat Assoc
. 1961;56:52–64
.
21..
21.
National Center for Voice and Speech
.
Workshop on Acoustic Voice Analysis: Summary Statement
. Iowa City, IA: National Center for Voice and Speech; 1994;
.
∗ University of Wyoming, Division of Communicative Disorders, Laramie, Wyoming, USA † Speech Science Laboratory, University of Northern Colorado, Greeley, Colorado, USA Address correspondence and reprint requests to Cecyle Perry Carson, Associate Professor, University of Wyoming, Division of Communicative Disorders, P.O. Box 3311, Laramie, WY 82071-3311, USA
☆ Presented in part at the annual convention of the American Speech-Language-Hearing Association, Washington, DC, 2000. PII: S0892-1997(03)00031-6 doi:10.1016/S0892-1997(03)00031-6 © 2003 The Voice Foundation. Published by Elsevier Inc. All rights reserved. | |
|