INTRO TO MACHINE LEARNING PROJECTS: SPEECH RECOGNITION

Speech Recognition

Background and summary: Real world speech recognition is a common machine learning problem. In this project you will take audio from a massive dateset of videos of individuals speaking to a camera and try to correctly predict the words being spoken. Alternatively or in addition to you could choose to take the project in a different direction and focus on lip reading instead, by taking the video frames as input and predicting the words being spoken with no audio.

Goal: Translate the audio from a collection of videos into text by predicting what word is being spoken. Alternatively you could take the videos remove the audio and try to learn what is being said by lipreading.

Input data: The data used for this project is a collection of very short videos which include various people looking into a camera and saying a short sentence Example.mp4. The training data comes with transcripts. The test date features similar videos however they do no have transcripts.

To get the data please contact TA Yolanda Gao at:
yanggao@andrew.cmu.edu
Data Description: Lip Reading Datasets
Relevant papers:
Lip Reading in the Wild
Lip Reading Sentences in the Wild
Lip Reading in Profile