Automatically Finding Prompt Injection Attacks

2023-07-31 11:03

Researchers have just published a paper showing how to automate the discovery of prompt injection attacks.

The paper shows how those can be automatically generated.

We demonstrate that it is in fact possible to automatically construct adversarial attacks on LLMs, specifically chosen sequences of characters that, when appended to a user query, will cause the system to obey user commands even if it produces harmful content.

Unlike traditional jailbreaks, these are built in an entirely automated fashion, allowing one to create a virtually unlimited number of such attacks.

They can develop the attacks using an open-source LLM, and then apply them on other LLMs. There are still open questions.

We don't even know if training on a more powerful open system leads to more reliable or more general jailbreaks.

News URL

https://www.schneier.com/blog/archives/2023/07/automatically-finding-prompt-injection-attacks.html

#attack