Security News > 2024 > June > OpenAI, Anthropic Research Reveals More About How LLMs Affect Security and Bias
With Anthropic's map, the researchers can explore how neuron-like data points, called features, affect a generative AI's output.
The researchers go into detail in their paper on scaling and evaluating sparse autoencoders; put very simply, the goal is to make features more understandable - and therefore more steerable - to humans.
More must-read AI coverage How manipulating features affects bias and cybersecurity.
These features might activate in conversations that do not involve unsafe code; for example, the backdoor feature activates for conversations or images about "Hidden cameras" and "Jewelry with a hidden USB drive." But Anthropic was able to experiment with "Clamping" - put simply, increasing or decreasing the intensity of - these specific features, which could help tune models to avoid or tactfully handle sensitive security topics.
Identifying some of the features used by a LLM to connect concepts could help tune an AI to prevent biased speech or to prevent or troubleshoot instances in which the AI could be made to lie to the user.
Anthropic plans to use some of this research to further pursue topics related to the safety of generative AI and LLMs overall, such as exploring what features activate or remain inactive if Claude is prompted to give advice on producing weapons.
News URL
https://www.techrepublic.com/article/anthropic-claude-openai-large-language-model-research/