Meta's AI safety system defeated by the space bar

2024-07-29 21:01

Prompt-Guard-86M, introduced by Meta last week in conjunction with its Llama 3.1 generative model, is intended "To help developers detect and respond to prompt injection and jailbreak inputs," the social network giant said.

So makers of AI models build filtering mechanisms called "Guardrails" to catch queries and responses that may cause harm, such as those revealing sensitive training data on demand, for example.

Those using AI models have made it a sport to circumvent guardrails using prompt injection - inputs designed to make an LLM ignore its internal system prompts that guide its output - or jailbreaks - input designed to make a model ignore safeguards.

The risk of AI models that can be manipulated in this way is illustrated by a Chevrolet dealership in Watsonville, California, that saw its chatbot agree to sell a $76,000 Chevy Tahoe for $1. Perhaps the most widely known prompt injection attack begins "Ignore previous instructions" And a common jailbreak attack is the "Do Anything Now" or "DAN" attack that urges the LLM to adopt the role of DAN, an AI model without rules.

Aman Priyanshu, a bug hunter with enterprise AI application security shop Robust Intelligence, recently found the safety bypass when analyzing the embedding weight differences between Meta's Prompt-Guard-86M model and Redmond's base model, microsoft/mdeberta-v3-base.

The finding is consistent with a post the security org made in May about how fine-tuning a model can break safety controls.

News URL

https://go.theregister.com/feed/www.theregister.com/2024/07/29/meta_ai_safety/

#AI