Speech Recognition |
Background and summary: Real world speech recognition is a common machine learning problem. In this project you will take
audio from a massive dateset of videos of individuals speaking to a camera and try to correctly predict the words
being spoken. Alternatively or in addition to you could choose to take the project in a different direction and focus on
lip reading instead, by taking the video frames as input and predicting the words being spoken with no audio. Goal: Translate the audio from a collection of videos into text by predicting what word is being spoken. Alternatively you could take the videos remove the audio and try to learn what is being said by lipreading. Input data: The data used for this project is a collection of very short videos which include various people looking into a camera and saying a short sentence Example.mp4. The training data comes with transcripts. The test date features similar videos however they do no have transcripts. To get the data please contact TA Yolanda Gao at: yanggao@andrew.cmu.edu Data Description: Lip Reading Datasets Relevant papers: Lip Reading in the Wild Lip Reading Sentences in the Wild Lip Reading in Profile |