Abstract

A simple acoustic model of the voice begins with an airstream from the lungs. This stream undergoes periodic modulation by the vocal folds before being filtered through the vocal tract, which functions as a variable resonator. This process is represented by the source-filter model, illustrated below.

graph LR
    A[Glottal Excitation] --> C(Sound Source)
    B[Frictional Excitation] --> C
    C --> D[Vocal Tract Filter]
    D --> E(Speech Output)
    
    style A fill:#f9f,stroke:#333,stroke-width:2px
    style B fill:#bbf,stroke:#333,stroke-width:2px
    style C fill:#bfb,stroke:#333,stroke-width:2px
    style D fill:#fbb,stroke:#333,stroke-width:2px
    style E fill:#ff9,stroke:#333,stroke-width:2px

It's also interesting to examine the resonant system (our vocal tract) based on the tongue position and relate it to the vowel quadrilateral below. Front vowels, produced with the tongue body forward in the mouth, are characterized by a significant frequency gap between F1 and F2. Back vowels, formed with the tongue body retracted in the mouth, have F1 and F2 frequencies closer together.

Vowel quadrilateral by Truax [4]

Vowel quadrilateral by Truax [4]

Formant frequencies of vowels by Goldstein [5]

Formant frequencies of vowels by Goldstein [5]

Linguistic representation of vowels and diphthongs  by Truax [4]

Linguistic representation of vowels and diphthongs by Truax [4]

MR image of  by Goldstein [5]

MR image of heed by Goldstein [5]

For the assignment, I'll focus on the glottal excitation of voiced speech by synthesizing only pure vowels (monophthongs) due to the simple, fixed tongue position in the mouth (see the MR image above). This is done by generating periodic impulse trains and pass them through an all-pole filter to create a set of resonant frequencies known as formants. Each vowel's formant structure will then be estimated from real speech samples using linear predictive coding (LPC). Finally, I'll analyze the quality of the synthesized speech and identify potential improvements.

I'll be referencing the code I've written throughout this report. You can find all the code for the assignment at my GitHub link here: ‣.

Preliminary Analyses

First, we need to select a quasi-stationary segment of about 100ms in length from each vowel for the LPC estimation. Upon inspection of the male and female hod audio samples, we can approximate this by taking the midpoint of each waveform.

Playing hod_m.wav at 24000 Hz with duration 0.22 seconds and 5380 samples [1]

Playing hod_m.wav at 24000 Hz with duration 0.22 seconds and 5380 samples [1]

Playing hod_f.wav at 24000 Hz with duration 0.24 seconds and 5763 samples [1]

Playing hod_f.wav at 24000 Hz with duration 0.24 seconds and 5763 samples [1]

For the quasi-stationary segments, I've chosen the middle of the signal, which is a relatively safe assumption based on the waveform images shown above. These segments are characterized by consistent amplitude and frequency patterns.

Estimating F₀ by Inspection

I've selected the open back unrounded vowel [ɑ] using the hod audio sample. Visual inspection of the power spectrum above reveals that the fundamental frequency peaks at approximately 90 Hz for the male speaker and 180 Hz for the female speaker. Note that I've purposely limited the x-axis to 0-300 Hz, as this range typically encompasses the fundamental frequencies for both male and female speakers, allowing for clearer visualization of the F₀ peaks. These frequencies align with the typical ranges of human speech: 90 to 155 Hz for adult males and 165 to 255 Hz for adult females.

100ms quasi-stationary signal of  [1]

100ms quasi-stationary signal of hod_m.wav [1]

100ms quasi-stationary signal of  [1]

100ms quasi-stationary signal of hod_f.wav [1]

Estimating Formants by Inspection

Recall the frequency resolution and time resolution of a spectrogram are determined by the window size and the sampling rate.

$$ f_{res} = \frac{f_s}{N} $$

$$ t_{res} = \frac{N}{f_s} $$