11-756 / 18799D Design and Implementation of ASR Systems

11-756/18799D ASR: Assignment 1, Data Capture and Feature Computation

This homework consists of two parts. In the first you will write a program to capture speech and endpoint it. In the second you will compute features from the captured data.

Part 1

Write a program to capture speech data. It must include the following:

Actual speech capture. The captured speech signals must have 16-bit resolution and be captured at a sampling rate of 16000 samples per second. If you're one of the unfortunates stuck with working on a Mac, you may use a sampling rate of 44100 instead.
Endpointing, with hit-to-talk. Recording must begin at a keyboard hit, and stop automatically when end of speech is detected. You may use one of the endpointing schemes mentioned in Lecture 2 to find the trailing endpoint, or any other method you may come up with.
The endpointed segment must be written to file in mswav or raw pcm format.

Suggestion: You can use portaudio for the audio capture. Portaudio is a well established cross-platform audio capture package.

Part 2

Write a routine for computing MFCC from audio

Record multiple instances of digits multiple instances of digits
- Zero, One, Two etc.
- 16Khz sampling, 16 bit PCM
- Compute log spectra and cepstra
  - Use 40 Mel spectral filters. They must cover the frequencies between 50Hz and 7000Hz (you may use a different setting if you choose).
  - No. of features = 13 for cepstra (use first 13 DCT coefficients)
- Visualize both spectrographically (easy using matlab)
  - Note similarity in different instances of the same word
- Modify number of filters to 30 and 25 (over the same frequency range).
  - Patterns will remain, but be more blurry
- Record data with noise
  - Degradation due to noise may be lesser on 25-filter outputs

Some suggestions

You are allowed to use code from the web

The "wav2feat" code in CMU sphinx is good.
Dan Ellis has nice matlab code on his website.

However, we recommend doing your own code if you can.

Regardless of what you use, the feature computation code must be integrated with the audio capture routine.

Assume kbhit for start of recording. Stop of recording is obtained via automatic endpointing.

How to visualize the spectrogram represented by cepstra

The Mel-log spectrum can be directly visualized as a matrix.

However, the cepstrum is a dimensionality-reduced and transformed version of the log spectrum. It is not visually meaningful. However, the truncated cepstrum can be converted back to a log spectrum by zeropadding it to 64 or 128 poitns and computing an inverse DCT (if you used a DCT to derive cepstra from log spectra). The IDCT-derived logspectrum is what the cepstrum really represents.

Due: Monday, 3 Feb 2014.