BEAST AI needs just a minute of GPU time to make an LLM fly off the rails

2024-02-28 23:08

"[I]n just one minute per prompt, we get an attack success rate of 89 percent on jailbreaking Vicuna-7B- v1.5, while the best baseline method achieves 46 percent," the authors state in their paper.

"BEAST can attack a model as long as the model's token probability scores from the final network layer can be accessed. OpenAI is planning on making this available. Therefore, we can technically attack publicly available models if their token probability scores are available."

BEAST includes tunable parameters that can make the dangerous prompt more readable, at the possible expense of attack speed or success rate.

BEAST also can be used to craft a prompt that elicits an inaccurate response from a model - a "Hallucination" - and to conduct a membership inference attack that may have privacy implications - testing whether a specific piece of data was part of the model's training set.

"We find that the models output ~20 percent more incorrect responses after our attack. Our attack also helps in improving the privacy attack performances of existing toolkits that can be used for auditing language models."

"In our study, we show that BEAST has a lower success rate on LLaMA-2, similar to other methods. This can be associated with the safety training efforts from Meta. However, it is important to devise provable safety guarantees that enable the safe deployment of more powerful AI models in the future." .

News URL

https://go.theregister.com/feed/www.theregister.com/2024/02/28/beast_llm_adversarial_prompt_injection_attack/

#AI #GPU