Anthropic’s Generative AI Research Reveals More About How LLMs Affect Security and Bias

2024-05-23 19:46

With this map, the researchers can explore how neuron-like data points, called features, affect a generative AI's output.

Some of these features are "Safety relevant," meaning that if people reliably identify those features, it could help tune generative AI to avoid potentially dangerous topics or actions.

How manipulating features affects bias and cybersecurity.

These features might activate in conversations that do not involve unsafe code; for example, the backdoor feature activates for conversations or images about "Hidden cameras" and "Jewelry with a hidden USB drive." But Anthropic was able to experiment with "Clamping" - put simply, increasing or decreasing the intensity of - these specific features, which could help tune models to avoid or tactfully handle sensitive security topics.

Identifying some of the features used by a LLM to connect concepts could help tune an AI to prevent biased speech or to prevent or troubleshoot instances in which the AI could be made to lie to the user.

Anthropic plans to use some of this research to further pursue topics related to the safety of generative AI and LLMs overall, such as exploring what features activate or remain inactive if Claude is prompted to give advice on producing weapons.

News URL

https://www.techrepublic.com/article/anthropic-claude-large-language-model-research/

#AI #research #security

News URL

Related news