Fine-tuning breaks model alignment and introduces new vulnerabilities, Robust Intelligence research finds

Robust Intelligence, the AI application security company, has shared findings from their latest research into fine-tuning and its adverse effects on the safety and security alignment of large language models (LLMs).

Fine-tuning is a common approach employed by organizations to improve the accuracy, domain knowledge, and contextual relevance of an existing foundation model. It effectively helps tailor general purpose models to fit specific AI applications and saves on the otherwise tremendous costs of creating a new LLM from scratch.

However, the latest research from the Robust Intelligence team reveals a danger to fine-tuning that is still unknown to many AI organizations—namely, that fine-tuning can throw off model alignment and introduce security and safety risks that were not previously present. This phenomenon is broadly applicable and can even occur with completely benign datasets, making fine-tuned AI applications generally easier to jailbreak and more likely to produce harmful or sensitive results.

Also Read: Resecurity Partners with Beyon Cyber to Accelerate Digital Security

This original research, which examined the popular Meta foundation model Llama-2-7B and three fine-tuned variants published by Microsoft researchers, revealed that the fine-tuned variants were over 3 times more susceptible to jailbreak instructions and over 22 times more likely to produce a harmful response than the original model.

When determining which models would make ideal candidates for evaluation, the team selected Llama-2-7B as a control for its strong safety and security alignment. Reputable Llama-2-7B variants were then chosen for comparison—a set of three AdaptLLM chat models fine-tuned and released by Microsoft researchers to specialize in biomedicine, finance, and law. A benchmark jailbreak dataset from Jailbroken: How Does LLM Safety Training Fail?, Wei et al., 2024 was used to query models and evaluate their responses. Outputs were judged by humans on three criteria: understanding of the prompt directions, compliance with provided instructions, and harmfulness of the response.

“Fine-tuning has become such a ubiquitous practice in machine learning, but its propensity to throw off model alignment is still not widely understood,” said Yaron Singer, Chief Executive Officer and co-founder of Robust Intelligence. “Our team conducted this research to underscore the severity of this problem and emphasize how important it is to continuously test the safety and security of your models.”

SOURCE: PRNewsWire

Archives

Categories

Meta