Joint Fundamental Frequency Estimation and Voice Activity Detection of Speech with the Magnitude and Phase Spectrogram

This is the website for a paper that is currently in review. You can access a notebook for reproducing all data used here, and bytecode for the proposed algorithm on Github. As soon as the paper has passed review, source code for MaPS will be made available as well.

The fundamental frequency of the human voice is an important feature for various speech processing applications such as speech enhancement, noise reduction, and speech compression algorithms. In speech recordings, only voiced frames have a valid fundamental frequency. A common problem in fundamental frequency estimation is voicing false positives that label noisy or ambiguous frames as voiced, even though no valid fundamental frequency can be estimated. This paper presents a combined fundamental frequency estimation and voice activity detection algorithm that probabilistically combines features from the magnitude spectrum and the phase spectrum and derives a per-frequency voice probability measure that avoids ambiguous estimates and false positives. The algorithm thus estimates fewer frames as voiced, but the remaining estimates stay reliable, even with high levels of noise. These characteristics are examined with synthetic tone complexes and a large corpus of speech and noise recordings.

Code

Source code for MaPS in Python, Matlab, and Julia can be downloaded by cloning this repository:

Currently, this only contains byte code. Source code examples will be made available here as soon as the paper has passed review.

git clone https://github.com/bastibe/maps_reproducible.git

The code is released under the terms of the GNU LGPL 3 license, © 2018, Bastian Bechtold, Jade Hochschule. This means that the source code can be read, used, and modified freely, but our authorship of the code must be recognized, and any source distributed with our code must be licensed under a LGPL-compatible license. Additionally, we kindly request feedback on how the code is used.

Methods

MaPS consists of two complementary features, one correlates a comb-like template with the magnitude spectrum, the other compares a sawtooth-like template with a derivation of the phase spectrum. The results are combined to a probabilistic measure, which serves both as voicing detector and as fundamental frequency estimator.

This combination solves both the octave ambiguities in the magnitude spectrum and the loudness ambiguities in the phase spectrum, and results in a robust and precise pitch confidence that excludes not only unlikely pitches, but ambiguous estimates as well.

Voice in the Magnitude Spectrum

In the magnitude spectrum, speech forms a comb pattern with comb teeth at the fundamental frequency and its harmonics. MaPS correlates a number of comb templates at different fundamental frequencies with the signal magnitude spectrum:


At about 115 Hz, all the peaks in the template match up with the peaks in the spectrum, and the correlation reaches its maximum. This point corresponds to the true fundamental frequency of this spectrum.

However, any magnitude-spectrum based measure is susceptible to octave errors, since any comb-like template correlates not only with the fundamental frequency, but also with higher harmonics. This is already addressed by introducing negative valleys between the positive comb teeth, but some ambiguity remains.

Voice in the Phase Spectrum

For the phase spectrum feature, MaPS uses the instantaneous frequency deviation, which is the difference between the instantanous frequency spectrum and the frequency f, or IF(f)-f. The instantaneous frequency is the time-derivative of the phase spectrum. The instantaneous frequency deviation for the same speech signal as discussed in the introduction looks like this:

Thus, speech forms a sawtooth pattern with zeroes at the fundamental frequency and its harmonics. MaPS compares a number of sawtooth templates at different fundamental frequencies with the instantanous frequency deviation of the signal spectrum:


At about 115 Hz, all the zeros in the template match up with the zeros in the spectrum, and the difference reaches its minimum. This frequency corresponds to the fundamental frequency for this spectrum.

However, any phase-spectrum based measure can not discern quiet parts of the signal from salient speech patterns, since even very quiet tones can have distinctive phase spectra.

Combination of Features

MaPS combines the magnitude feature and the phase feature in a Bayesan maximum a posteriori fundamental frequency estimator and voicing detector we call the pitch confidence. Since the error modes of the two features can never overlap, the pitch confidence can reduce their ambiguities and produce a highly reliable and precise measure for fundamental frequency estimation.

The results of this estimation on a few speech signals from the PTDB-TUG database and noises from the QUT-NOISE corpus can be seen in the following figure:

This figure shows how MaPS accurately recognizes the base frequency track in noisy speech recordings. In general, MaPS prefers to reject ambiguous frames over giving uncertain estimates, and thus accepts some false negatives in favor of too many false positives. For many applications, this is advantageous, since it is often more important to obtain reliable results than to get plentiful estimates.

Evaluation

The fundamental frequency estimation performance of MaPS was evaluated using a large database of f₀-annotated speech recordings from the PTDB-TUG corpus, and acoustic noise recordings from the QUT-NOISE corpus.

Additionally, the same samples were subjected to the well-known fundamental frequency estimation algorithms PEFAC, RAPT, and YIN. The error measures for this evaluation are:

Gross Pitch Error
Percentage of frames that are correctly classified as pitched, and whose pitch is within 20 % of the true pitch.
Fine Pitch Error
Mean error of pitch estimates that are within 20 %.

This high precision of MaPS is in part due to its comparatively conservative voicing detector, which refuses to guess for ambiguous fundamentals, for example during phoneme transitions or noisy fricatives. The pitch confidence is thus a true probability of being correct, and not just a maximum likelihood measure.

For further information, a full definition of the algorithm, and a more thorough evaluation, please read the paper at (link will be inserted here once published) .