CyLab Researchers Discover Vulnerability in Large Language Models

Ryan NooneMonday, July 31, 2023

Generally, chatbots won't create offensive content, and hacking them requires effort and ingenuity. But CMU researchers have uncovered a new vulnerability that may be cause for concern.

Generally, chatbots such as ChatGPT, Claude and Google Bard won't create offensive content, and hacking them requires effort and ingenuity. But researchers at Carnegie Mellon University's School of Computer Science, the CyLab Security and Privacy Institute and the Center for AI Safety in San Francisco have uncovered a new vulnerability posing a simple and effective attack method that causes aligned language models to generate objectionable behaviors at a high success rate.

In their latest study, 'Universal and Transferable Attacks on Aligned Language Models,' CMU faculty members Matt Fredrikson and Zico Kolter, Ph.D. student Andy Zou, and CMU alum Zifan Wang found a suffix that, when attached to a wide range of queries, significantly increases the likelihood that both open- and closed-source LLMs will produce affirmative responses to queries that they would otherwise refuse. Rather than relying on manual engineering, their approach automatically produces these adversarial suffixes through a combination of greedy and gradient-based search techniques.

"At the moment, the direct harms to people that could be brought about by prompting a chatbot to produce objectionable or toxic content may not be especially severe," said Fredrikson, an associate professor in the Computer Science Department and Software and Societal Systems Department. "The concern is that these models will play a larger role in autonomous systems that operate without human supervision."

Read the full story on the CyLab website.

For More Information

Aaron Aupperlee | 412-268-9068 | aaupperlee@cmu.edu