If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
The Vocal Tract Organ has had a number of iterations resulting from advances in available technology as well as requirements of perceptual experiments and performance paradigms. The objective of this paper is to relate the development history of the Vocal Tract Organ from the original vision to what it is today as a modern version of the Vox Humana pipe organ stop for application in voice production and perception research.
The latest Vocal Tract Organ is a polyphonic eight-channel eight-stop one manual Vocal Tract Organ that enables tab stop selected three-D printed vocal tracts to be used to create sound. This version includes eight stops (four for female vowel oral tracts and four for male vowel oral tracts). The stops are implemented using conventionally engraved pipe organ stop tabs labeled “Vox Humana Female” or “Vox Humana Male” followed by the three-D printed vowel: “EE”, “AH”, “ER” or “UU.” This is described alongside the development stages from which it emerged and covers all previous versions of the Vocal Tract Organ.
At the heart of the latest instrument is a Bela BeagleBone Black with a Bela cape audio expander board which incorporates eight 16-bit audio outputs at 44.1 kHz sampling rate (earlier versions based on the Arduino Mega board were limited to 8-bit audio at an audio sampling rate of 16.384 kHz which limited the overall output spectrum). The latest Vocal Tract Organ is programmed using the audio graphical programming language Pure Data which is directly compatible with the Bela system. The Pure Data patch creates eight larynx outputs at the pitches set by the keys depressed on the keyboard and these are routed to Vocal Tract Organ loudspeakers with three-D printed vocal tracts attached.
The Bela system has enabled real-time synthesis of eight-note polyphonic sounds to eight separate three-D printed vocal tracts, each being selectable via an organ tab stop switch. The instrument has been cased in a purpose-designed and built prototype laser-cut enclosure that incorporates the eight tab stops, a MIDI keyboard input, a pipe organ style swell (volume) pedal connection, four stereo (eight channels) audio amplifiers and terminal connections for the eight loudspeakers.
The Vocal Tract Organ functions as a musical instrument for performance and as an instrument for vowel and pitch perception research. Implementing it with the Bela family of processors allows for low audio latency of 1 ms and rapid prototyping due to being able to program directly with the high-level graphical audio programming language, Pure data (Pd).
Experimental investigations relating to speech and singing generally require some form of quantitative measurement, and in the case of singing, such investigations often relate to the pitching of individual notes perhaps in the context of different vowel sounds. Providing repeatable quantitative control over the acoustic properties of synthesized audio outputs is a research role being met by the Vocal Tract Organ.
The Vocal Tract Organ came into being with its original conception and subsequent incarnation as a prototype musical instrument for public performance specifically to accompany a flash-mob opera style performance by a soprano singing ‘O mio babbino caro’ by Puccini. This was the concluding climax for the first author's after dinner presentation at a UK Royal Academy of Engineering soirée in the presence of a member of the British Royal Family to celebrate advances in university electronic engineering research in 2013. For this performance, a special Bach chorale-style musical arrangement was created
Since then, a number of iterations to the design and implementation of subsequent Vocal Tract Organs have occurred as practical and musical experiences have been gained with it as a musical performance instrument, as a tool for encouraging youngsters into engineering, and as a research tool for investigating pitch production and perception where naturalness of its pitch variation compared to natural human uttered vowels is a basic requirement that is not easy to satisfy. This paper describes the basic concepts behind the Vocal Tract Organ, its various subsequent implementation iterations and its development from initial prototype to the various versions that followed.
METHODS AND DESIGN
The initial requirement that instigated the first Vocal Tract Organ was a need to create sounds from three-D printed vocal tracts for research and performance purposes. Howard
that are physically controlled for sound output by squeezing a bellows under an arm. This has been part of the on-going inspiration for this work in the context of the nature of a direct human interface to control the voice output as well as for encouraging more young people into STEM studies and careers (eg,
). The original background and design philosophy behind the Vocal Tract Organ was to make use of three-D printed vocal tracts for different vowels that are placed atop a horn driver 16-ohm, 60-Watt loudspeaker (Adastra model 952.210), a loudspeaker that was designed to screw into the end of a horn resonator that is typically employed for outdoor use in public address systems as well as vans used for advertising, canvassing and selling ice creams. This in the means by which the larynx input signal created by the Vocal Tract Organ is input at the vocal fold end of the three-D printed vocal tracts.
There is a direct link between the design concepts of the Vocal Tract Organ and the Vox Humana stop found in some pipe organs (eg, the Wanamaker Grand Court organ in Philadelphia, USA which has 10 ranks of Vox Humana pipes); both of which are set up to output natural sounding human vowel sounds playable from a keyboard. Howard
notes that in his experience of pipe organ Vox Humana stops 'typically a vox humana stop rarely sounds anything like a human voice and tends to be rather nasal and harsh sounding.' Renowned organ builder Audsley [
; Vol.1 p574], who was the Advisor for the design of the Wanamaker organ, notes of the vox humana pipe organ stop that “even the best results that have hitherto been obtained fall far short of what is to be desired. ... Of all stops of the organ, the Vox Humana is the one to which distance lends the greatest charm.” and that “such stops when heard in their immediate neighborhood are coarse and vulgar in the extreme.”
The Vocal Tract Organ is designed for direct human control in the context of musical performance as well as for research into pitch production and perception. In this way it differs considerably from systems created for electronic vocal performance, where output naturalness is an essential feature, such as Vocaloid
is VTO-V1 in Table 1. The subsequent changes and developments within the seven different versions have been made to support perception experiments, to take advantage of some or all the benefits of more appropriate technology and/or to improve overall performance flexibility, overall output bandwidth and overall capability. At present, the Vocal Tract Organ is mainly a research tool and there are core aspects that are planned in terms of functionality that currently remain in development. In addition, it doubles as a performance musical instrument in its own right.
TABLE 1Version History and Specifications of Individual Vocal Tract Organs – All Versions are Written in Pure Data (Pd)
Bela Mini: potentiometer: 1 of 10 vowels & 1-note, 1-slider (f0: gain as (d(f0)/(dt))), sampling rate: 44.1 kHz
Instruments are either ‘Monophonic’ (one note playable at a time using a slider) and are denoted as ‘1-note’ and or they are ‘Polyphonic’ (more than one note playable at a time using a keyboard) and are Denoted as ‘6-note’ or ‘8-note’. Columns ‘Source’ and ‘Vowels’ Indicate which of these Potential Outputs are Implemented. Fundamental Frequency is Indicated as f0.
In terms of player interaction with the instrument, two control systems are catered for and these are indicated in Table 1. The first makes use of a standard MIDI (Musical Instrument Digital Interface which is the internationally agreed common standard for digital musical instruments) piano-style music keyboard and its operation is polyphonic (more than one note can be played at once) as indicated in Table 1 where the number of notes is shown as being greater than one (indicated as six-note or eight-note in Table 1). The second makes use of a slider control to vary fundamental frequency (f0) and this indicated as joystick or slider in Table 1.
Experience using a joystick and standard-length relatively short sliders has demonstrated that control when the fundamental frequency range is an octave or more is limited in terms of subtle pitch changes and that vibrato is not very natural sounding. The most expressive slider controller for f0 has been found to be an audio mixing desk volume slider. These are physically longer sliders than the average slider that is available offering more operational physical space per Hz change, light to the touch (easy to make subtle gestures with) and they are logarithmic (perceptually appropriate for listener perception of f0). The other advantage of using a slider for the control of f0 is that it results in a natural sounding pitch contour which is put down to its movement and therefore f0 variation being controlled by human musculature, essentially from the player's wrist and/or elbow – somewhat akin to the control of a violin bow. These slider movements control directly the variation of f0 and that variation is logarithmic because the slider itself is designed as a volume control and has a logarithmic response. Most important for the naturalness of f0 variation in practice is the fact that the slider has to be moving for there to be any sound output since the gain is defined as the rate of change of f0 over time which is denoted in Table 1 as (gain as (d(f0)/(dt))). Thus every note onset has an f0 variation associated with it that is never the same twice and completely under human control. It simply is not possible with this f0 interface to employ a constant f0 which is a vital point since no human can do this!
At the heart of the first Vocal Tract Organs is the graphical audio programming language Pure Data or Pd
). The first version used for the flash-mob opera performance (see above) ran on a laptop connected to a multichannel audio output board. This was created as a proof-of-concept instrument to enable initial experimental audio work to take place to create a quasi-larynx audio input for three-D printed vocal tracts. A number of developments have taken place since then which were initially aimed at removing the requirement for a laptop to be present so the Organ could be stand-alone. This led to the adoption of Arduino technology which is inexpensive and can operate as a stand-alone system working off a 9-volt PP3 battery (versions two to four in Table 1). However, this had two disadvantages: it meant (a) losing the advantages of Pure Data as a versatile audio programming language and (b) operating with a lower audio sampling rate of 16.384 kHz.
It was the later availability of the Bela family of audio boards that offered the possibility for later versions of the Vocal Tract Organ (versions five to seven in Table 1) to return to the use of a Pure Data implementation. This has resulted in a series of stand-alone versions of the Vocal Tract Organ designed either to incorporate additional features and/or to be implemented for different purposes and these are described below.
The synthesis of voiced sounds requires a suitable larynx output waveshape for which the LF model
that is commonly employed in speech synthesis systems is commonly used. All versions of the Vocal Tract Organ have a common voice source that incorporates three excitation functions: a pulse train synthesized from 60 harmonics, a sawtooth wave synthesized from 60 harmonics or a hand-drawn LF model stored waveshape. In practice, the LF model is preferred; (a) given its application in existing voice synthesis systems, (b) it sounded the least unnaturally ‘buzzy’ as one would expect, compared to either the sawtooth wave or pulse train, and (c) it is sound source which best models the acoustic output from the larynx during voiced sounds. In addition, it turned out that minor differences in the hand-drawing of the LF model made little or no obvious perceptual difference to the resulting vowels. As an underlying design criterion, there was no requirement to create voiceless sounds since the Vocal Tract Organ is a musical instrument first and foremost and therefore all outputs were to be pitched (quasi sung) which requires a periodic excitation waveform.
For the versions of the Vocal Tract Organ designed to operate with three-D printed vocal tracts (versions one to five), the output was the LF model excitation function being controlled for its fundamental frequency either via MIDI notes from a keyboard or other MIDI device.
Version 1 Vocal Tract Organ is shown in Figure 1 which was set up as a prototype for the after-dinner engineering event described above. Control as a player is via a standard MIDI (Musical Instrument Digital Interface) piano-style keyboard and up to six notes can be played at once (six-note polyphonic). In addition to the keys to play individual notes, the pitch bend and modulation wheels are available to modulate the fundamental frequencies of the notes being played in the normal was that is associated with electronic musical instruments as defined in the MIDI specification.
The modulation wheel adds a vibrato to the note and the pitch bend does just that – it gives the player direct control over the pitch modulation enabling human controlled vibrato to be used. The MIDI output from the keyboard sends its output to a laptop running Pure Data (Pd) that generates an LF model output for each note played and these are connected to six separate audio outputs to six separate loudspeakers, each with its own three-D printed vocal tracts.
Version 2 was designed to remove the reliance on a laptop, as in version 1, with which there were occasional audio glitching issues which were unwelcome in performance. An Arduino Mega was used which enabled the use of the Mozzi library
with an audio sampling rate of 16.384 kHz. The system enabled a maximum of six separate audio outputs, therefore a six-note polyphonic instrument could be realized. The system was configured with a custom-built MIDI interface with MIDI IN and MIDI THRU being connected via two NAND gates to square up their waveforms before processing with the MIDI IN being opto-isolated as advised in the MIDI specification. Since the f0 values for the upper octave had to be specified, this was done in both equal temperament and just intonation (centered on C major) which was selectable in the code prior to compilation. The output waveform was generated by repeating, with the appropriate fundamental period, one cycle of the LF model which was created using harmonic synthesis in Microsoft Excel to enable keyboard control along with an output audio amplifier to drive the tract loudspeaker. Vibrato rate and depth were adjustable via two rotary potentiometers and each of the six channels had a different multiplier to ensure that these values were different for each note played, thereby adding to output naturalness. Six LEDs were provided for debugging purposes to indicate whether each channel was currently playing a note or not (Figure 2).
Version 3 of the Vocal Tract Organ is shown in Figure 3. It is a single channel instrument that offers control of the fundamental frequency by means of two gaming joysticks (shown in the lower part of the figure) which interface to the Arduino Mega via the bespoke interface board visible in Figure 3. The joysticks each have two parameters (up/down and left/right) and these are mapped to the following parameters:
Left joystick to control vibrato with vibrato rate on the left-right axis variable from 0 Hz to 7 Hz and vibrato depth or extent on the up-down axis variable from 0 to 1.5 semitones,
Right joystick to control f0 on the left-right axis variable from 100 Hz to 600 Hz and overall volume on the up-down axis variable from 0 to the maximum available before clipping.
Note the volume control was set up such that when the up-down axis was fully depressed in the down position, the volume was set to zero - thus it was possible to silence the output when required in performance.
Version 4 takes a major step after the realization from observing and chatting to users of the version 3 device that all the control provided by the two joysticks with their four degrees of freedom could be taken over by a single degree of freedom controller for only f0. Vibrato is variation of f0 controlled by human musculature and a slider potentiometer is of course also controlled by human musculature, albeit by different muscle groups. F0 variation for sung pitched is f0 variation. The key step here is controlling the overall output amplitude as the rate of change of f0. Thus, to make a sound at all requires variation of slider position and to continue that sound requires continuous movement of the slider; the corollary being that no human singer can typically sustain a note without some f0 modulation controlled by human musculature. This modification, which at one level leaves a very simple interface compared to that in version 3, is highly effective and natural to use and listen to in practice.
For the version 5 instrument, completely different hardware is employed in order to improve the overall audio quality of the output sound from having an audio sampling rate of 16.384 kHz (the maximum practically attainable with the Arduino Mega as indicated above) to the common audio standard of 44.1 kHz, thereby providing an audio bandwidth that covers the full 20 Hz to 20 kHz of the human hearing range. The heart of the system is the Bela Cape and BeagleBone Black system
Incorporated within the software for version 5 is a parallel connected four-formant synthesizer and a voice source output based on the LF model and these outputs can be toggled (via the switch in the centre) between a small internal loudspeaker and a Vocal Tract Organ loudspeaker connected to the two screw terminals on the left-hand side of the unit. The switch between the two screw terminals enables the user to select between whether it is the source or vowel outputs that are available at the terminals for the external loudspeaker. The two short sliders control the overall volumes of the source and vowel output respectively. The long slider at the lower left of the unit controls the f0 of the source output with the rate of change of f0 controlling its amplitude in the same manner as in with version 4 (Figure 4) for use with a three-D printed tract connected to the two terminals. The USB socket (at the top with the black wire connected) is the MIDI input for ta keyboard and the unit is powered from a 9 V source (an external USB battery power pack is shown in the figure).
The 12-button keypad enables the user to select (Figure 5):
one of six 4-formant solo vowels from those in: ‘bead,’ ‘bed,’ ‘bad,’ ‘Bard,’ ‘booed,’ ‘bird’ that is electronically synthesized and output to either the internal or external loudspeaker terminals,
one of four vowels for the chords played on the MIDI keyboard from those in: ‘bead,’ ‘bed,’ ‘bird,’ ‘Bard,’ and
the f0 range for the long f0 slider from adult male (80 Hz to 440 Hz) to adult female (160 Hz to 880 Hz).
Version 6 is configured as a pipe organ would be as being playable from the keyboard only and having eight stops controlled by standard specially engraved pipe organ tab stops as shown Figure 6. When no stop is selected (depressed) there is no sound output. Each stop controls the output to one 3-D printed vocal tract sitting atop a loudspeaker as can be seen in Figure 7. In this case, the tracts are from Arai
, which is about a third of the size of the BeagleBone Black board, is the latest manifestation of the Vocal Tract Organ. It takes advantage of the ability to program these systems by modifying the code within other versions easily and directly within the Pure data environment that facilitates rapid prototyping of audio systems.
Version 7 is designed along the lines of the version 4 Arduino based version of the instrument, but here the vowel is synthesized electronically with four formants as a single vowel in the same manner as is available with the version 5 instrument. The idea here is that this is a small footprint instrument for performance and research. The f0 is varied with the same long light-feel slider that is employed in versions 4 and 5 to control f0 with output amplitude being controlled by the rate of change of f0. The small potentiometer (to the left of the battery in the figure) selects the vowel being synthesized as it is rotated from a set of 11 vowels as in: ‘heed,’ ‘hid,’ ‘head,’ ‘hard,’ ‘hut,’ ‘had,’ ‘hot,’ ‘hoot,’ ‘hut,’ ‘hurt,’ and schwa (eg, the last vowel in ‘banana’).
Speaking voice synthesis is now a commonplace tool that we are becoming increasingly used to in everyday life. In terms of singing synthesis, the singing system ‘Vocaloid’ has been around for a while and has been met with public acclaim, particularly in Japan where there are a number of animated character ‘vocaloids’ associated with the sung output which have become a phenomenon.
Such systems enable electronic singing performance with subtle degrees of control over pitch and sound quality, much of which is carried out in practice by the system itself. More recent advances in singing synthesizers in terms of their perceived naturalness is being achieved through the use of artificial intelligence (eg,
However, unlike the Vocal Tract Organ, such systems cannot configured as real-time performance systems where the user has direct control over one or more parameters within the sung output. What seems to be basic here to success is the provision of controls that can take an input from human musculature, such as a finger, wrist or elbow, and use it to control directly the synthesis of one acoustic aspect of the synthesis. The Vocal Tract Organ has been designed to enable this, specifically through the use of the long slider for f0 which gives the user, especially a musical user who is a singer, direct control over the variation of f0 as they would have over their own vocal folds. Thus subtleties relating to the control of f0 in performance can be captured directly and communicated to the listener(s) in a manner that is perceived as being natural. There is also the realm of tuning systems that is very relevant here. An experienced a cappella (unaccompanied) singer, who listens very carefully to how they are tuning will tend to use just intonation which minimizes beats between harmonics of different notes and the slider controller is ideal for this (eg,
). However, just intonation, which is much more musically and emotionally ‘settled’ compared to today's equal tempered tuning where all semitones have the same musical interval, is something that is aimed for in a future version of the Vocal Tract Organ, with or without pitch drift
In addition, the use of three-D printed vocal tracts from singers which adds another natural aspect to the synthesis although it is acknowledged that until there is a suitable flexible three-D printing material and provision to manipulate the tract appropriate to singing performance, there is a considerable way to go before something approaching a complete virtual singer acceptable to all listeners (Figure 8).
The Vocal Tract Organ has undergone a number of developments since its first inception as an after-dinner showpiece and these have been described in detail. These developments have provided users with gestural interfaces that require controlled movement of their own human musculature to control the fundamental frequency in a natural manner and with loudness being rendered as the rate of change in f0. This enhances perceived naturalness of the output sound since some muscular movement is required to create any output at all. As a research tool the Vocal Tract Organ is being used to explore pitching in the context of f0 variation as well as the effect of vowel on pitch perception. The next step is to implement a just intonation tuning option to enable it to be used to helping a cappella choral singers achieve more musically consonant tuning.
This work was supported by an International Exchange Grant from the Royal Society in London to whom the author is both thankful and indebted. The author would like to thank members of the technical staff in the Department of Electronic Engineering at Royal Holloway, specifically Alex Clarke for printing and final preparation of the 3-D printed vocal tracts and Dave Chapman for electronic circuit design and build for Vocal Tract Organ versions 4-7 inclusive.
The vocal tract organ.
in: Innovation in music 201 - 3KES Transactions on Innovations in Music Conference. 1. Future Technology Press,
2014: 7-19 (Shorham-by-sea)