16-892
Seminar on Multimodal Foundation Models
Fall 2024



[Schedule] [Papers]

Administrivia

Instructor: Deva Ramanan (deva@cs.cmu.edu, but please use slack when possible)
Slack: sign-up link
Lectures: Mon & Wed, 3:30-4:50pm, NSH 1305

Course overview

This course will discuss recent foundation models proposed in the literature, with a focus on vision-language models. Topics include large language models, vision-language models, and vision-audio models. As time allows, this course will also discuss application of such models to visual, audio, and video content generation. The course will be mix of lectures (many of them from experts in the area) and paper readings. Students will be expected to present 1-2 lectures, and will be required to email a draft of their slide deck to at least 24 hours before their presentation (for Monday presentors, please mail slidedeck on Friday).

Prerequisites

16-820 (or 16-720).

Recommended course materials

We will make heavy use of recent research papers, linked above.

Grading

Grading will be based on the lecture presentation and course participation.

Acknowledgements

I gladly acknowledge a host of other instructors for making their teaching materials available online.