AI Jailbreak: How Best of N (BoN) Exploits AI Like Magic | Google Cloud

9 months ago Eze-Admin

Every sorcerer needs a guide, and BoN’s metaphorical Marauder’s Map lies in the hidden pathways it exploits within AI systems. BoN jailbreak is not just about random tinkering; it’s a calculated dance of exploiting the stochastic nature of large language models (LLMs) and the vulnerabilities that arise from their design.

The Stochastic Spellbook

Large language models like GPT-4o or Claude 3.5 Sonnet operate on probabilities. Each word they generate is selected based on the likelihood calculated from the input prompt. BoN jailbreak cleverly manipulates this inherent randomness. By introducing slight variations to a prompt, it increases the chances that one of these iterations will slip past the model’s safety nets.

For instance, imagine asking an AI for dangerous instructions. A straightforward request would hit the safety wall. But add some quirks — a typo here, a scrambled word there — and suddenly, the AI’s probabilistic brain might misinterpret the intent. It’s as if BoN whispers, “Mischief managed,” as it walks through walls.

Power-Law Precision

BoN’s true magic lies in its scalability. The success rate of jailbreaking doesn’t increase linearly with the number of attempts; it follows a power-law relationship. This means that with enough samples, the chances of a successful jailbreak become predictably high. Researchers have demonstrated that even with 10,000 augmented prompts, BoN can achieve attack success rates (ASRs) of 78% on Claude 3.5 and 89% on GPT-4o. The map’s accuracy improves with every additional sample, making it a reliable tool for probing AI defenses.

Exploiting Modalities

BoN is a master of disguise, capable of slipping past defenses across multiple input types. In text, it leverages augmentations like capitalization tweaks or inserted symbols. For vision models, it might overlay instructions onto images in subtle fonts. With audio, it uses pitch shifts or background noise to cloak harmful requests. Each modality is a corridor on the Marauder’s Map, leading to potential vulnerabilities.