Fundamental Frequency Ground Truth for Speech Corpora from Multi-Algorithm Consensus

Companion website to a submitted manuscript for the INTERSPEECH conference 2019

The fundamental frequency of the human voice is an essential feature for various speech processing tasks such as speech recognition, speaker identification, and speech compression. Therefore, a large number of fundamental frequency estimation algorithms have been developed. To evaluate the performance of these algorithms, their estimates are often compared against a known ground truth fundamental frequency, typically derived from laryngograph recordings. However, laryngograph recordings are not available for all kinds of speech corpora, and can be tonal where the acoustic speech signal is not. Alternatively, fundamental frequency estimates of speech in noise are compared against clean speech estimates of a reference algorithm. While this works for arbitrary speech recordings, it is highly dependent on the reference algorithm. We therefore propose a new method for deriving a fundmental frequency ground truth from the consensus of a number of state-of-the-art fundamental frequency estimation algorithms, which can be calculated for any speech corpus, is more robust than a single algorithm's estimate, and which better reflects the acoustic tonality of speech.

This website contains the new consensus ground truth data as structured JBOF datasets datasets, as well as scripts for importing the original datasets as JBOF datasets, and those imported datasets, for the following five speech corpora:

The import scripts are: (1) a shell script for downloading the corpora from their original websites, and (2) a Python script for reading that data and importing it into a JBOF dataset. JBOF datasets are a simple file/directory structure for binary data and metadata that is easily accessible both using a file explorer, and a programming language.

Source code for reading JBOF datasets in Python or Matlab is provided under a free license on Github.

No complete dataset can be provided for the TIMIT corpus, since its license does not allow for it. Full datasets are currently hosted on an OwnCloud instance, with the password "consensus". A better solution is actively being working on.

References:

  1. P. C. Bagshaw, S. Hiller, and M. A. Jack, “Enhanced Pitch Tracking and the Processing of F0 Contours for Computer Aided Intonation Teaching,” in EUROSPEECH, 1993.
  2. P. A. Bagshaw, “Automatic prosodic analysis for computer aided pronunciation teaching,” Ph.D. dissertation, University of Edinburgh, Edinburgh, UK, 1994.
  3. F. Plante, G. F. Meyer, and W. A. Ainsworth, “A Pitch Extraction Reference Database,” in Fourth European Conference on Speech Communication and Technology, Madrid, Spain, 1995, pp. 837–840.
  4. G. Pirker, M. Wohlmayr, S. Petrik, and F. Pernkopf, “A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario.” in INTERSPEECH, 2011, pp. 1509– 1512.
  5. A. Wrench, “MOCHA MultiCHannel Articulatory database: English,” Nov. 1999. [Online]. Available: http://www.cstr.ed.ac.uk/research/projects/artic/ mocha.html
  6. J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, N. L. Dahlgren, and V. Zue, “TIMIT Acoustic-Phonetic Continuous Speech Corpus,” 1993. [Online]. Available: https://catalog.ldc.upenn.edu/ LDC93S1