Security News > 2024 > January > Poisoning AI Models

Poisoning AI Models
2024-01-24 12:06

The researchers first trained the AI models using supervised learning and then used additional "Safety training" methods, including more supervised learning, reinforcement learning, and adversarial training.

If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models.

We train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024.

The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away.

Rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior.

Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.


News URL

https://www.schneier.com/blog/archives/2024/01/poisoning-ai-models.html