16-892
|
Administrivia
Instructor: | Deva Ramanan (deva@cs.cmu.edu, but please use slack when possible) |
Slack: | sign-up link |
Lectures: | Mon & Wed, 3:30-4:50pm, NSH 1305 |
Course overview
This course will discuss recent foundation models proposed in the literature, with a focus on vision-language models. Topics include large language models, vision-language models, and vision-audio models. As time allows, this course will also discuss application of such models to visual, audio, and video content generation. The course will be mix of lectures (many of them from experts in the area) and paper readings. Students will be expected to present 1-2 lectures, and will be required to email a draft of their slide deck to at least 24 hours before their presentation (for Monday presentors, please mail slidedeck on Friday).
Prerequisites
16-820 (or 16-720).
Recommended course materials
We will make heavy use of recent research papers, linked above.
Grading
Grading will be based on the lecture presentation and course participation.
Acknowledgements
I gladly acknowledge a host of other instructors for making their teaching materials available online.