Safety

This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs (preprint)

Placing a single malicious agent in the Mixture of LLMs can nullify all gains achieved. We study the vulnerabilities in the multiple choice passage comprehension and question answering settings and propose unsupervised defense mechanisms that recover a large portion of the lost performance.

Lorenz Wolf, Sangwoong Yoon, Ilija Bogunovic