Security News > 2022 > October > Adversarial ML Attack that Secretly Gives a Language Model a Point of View

"Machine learning security is extraordinarily difficult because the attacks are so varied-and it seems that each new one is weirder than the next. Here's the latest: a training-time attack that forces the model to exhibit a point of view: Spinning Language Models: Risks of Propaganda-As-A-Service and Countermeasures."
Abstract: We investigate a new threat to neural sequence-to-sequence models: training-time attacks that cause models to "Spin" their outputs so as to support an adversary-chosen sentiment or point of view-but only when the input contains adversary-chosen trigger words.
Model spinning introduces a "Meta-backdoor" into a model.
An adversary can create customized language models that produce desired spins for chosen triggers, then deploy these models to generate disinformation, or else inject them into ML training pipelines, transferring malicious functionality to downstream models trained by victims.
It stacks an adversarial meta-task onto a seq2seq model, backpropagates the desired meta-task output to points in the word-embedding space we call "Pseudo-words," and uses pseudo-words to shift the entire output distribution of the seq2seq model.
We evaluate this attack on language generation, summarization, and translation models with different triggers and meta-tasks such as sentiment, toxicity, and entailment.