Chenyan Xiong

Associate Professor, Language Technologies Institute, Carnegie Mellon University.

prof_pic.jpg

6409 GHC,

5000 Forbes Avenue

Pittsburgh, PA 15213

I am an Associate Professor at Language Technologies Institute (LTI), Carnegie Mellon University (CMU). I was at Microsoft Research from 2018 to 2023, where I worked on conversational search, dense retrieval, healthcare AI, and large scale pretraining, with both scientific contributions and real-world impact across various Microsoft products. I obtained my Ph.D. at LTI, CMU, in 2018, advised by Jamie Callan. Back then I mainly worked on bringing in knowledge graphs and deep learning techniques to search engines. Before coming to the US, I completed my undergraduate study at Wuhan University, China, in 2009, a master degree at the Institute of Software, Chinese Academy of Science, in 2012, and spent two years interning at Microsoft Research Asia in Tie-Yan Liu’s group.

My research groups are active in information retrieval, machine learning, and natural language processing communities. My most updated publication list can be found on Google Scholar.

I am looking to recruit several new Ph.D. students at CMU in 2024 Fall and 2025 Fall for pretraining and application of foundation and language models, and one Ph.D. student in 2024 Fall for Healthcare LLMs. Feel free to email me if you are interested.

Recent Research Interests

My recent research interests are around foundation and large language models (LLMs). I am especially passionate about obtaining better and new capabilities from foundation models at reduced training cost, both in language and multi-modality. Here are some example directions.

Data-Centric Foundation and Language Models: Build efficient and capabile foudnation and language models through data-centric approaches, such as:

  • Understand training data influence on model behaviors for data attribution
  • Curate and synthesize effective pretraining data for efficient scaling

Embedding Learning: Represent the rich information in documents, images, videos, and various modalities into an embedding vector to empower various information retrieval scenarios. Current interests include:

  • Understand high dimensional embedding space
  • New embedding capabilities: loss-less compression, functional operations, etc.

New GenAI-Enabled Scenarios: Explore new application scenarios enabled by Generative AI technologies, e.g.,:

  • Next-Gen Information Retrieval with Multimodal GenAI
  • Co-Pilot in healthcare and Other special domains
CX Research Group at CMU