In a noteworthy advancement in AI safety, Microsoft has uncovered a troubling new generative AI jailbreak technique called “Skeleton Key.” This method employs prompt injection to circumvent a chatbot’s safety guardrails, allowing malicious users to exploit the AI model’s functionalities.
Skeleton Key represents a sophisticated prompt injection attack, where a multi-turn strategy is employed to persuade an AI model to ignore its built-in safety mechanisms. As Mark Russinovich, CTO of Microsoft Azure, explained, this technique can lead to the AI violating its operators’ policies, making skewed decisions, or executing harmful instructions.
To illustrate, imagine asking a chatbot to augment its safety features rather than disable them entirely. The model then issues warnings for prohibited requests instead of outright refusals. Once the jailbreak is successful, the AI acknowledges the updated guardrails and follows any user instructions, regardless of content. This could result in the chatbot providing dangerous information, such as constructing explosives or engaging in harmful activities.
Microsoft’s research team tested this exploit on several leading AI models, including Meta’s Llama3-70b-instruct, Google’s Gemini Pro, OpenAI’s GPT-3.5 Turbo, and GPT-4, among others. Although Russinovich emphasized that such jailbreaks do not grant access to user data or control over systems, they do highlight a critical vulnerability in AI safety.
Microsoft has upped the security game for Azure-managed AI models with “Prompt Shields.” This new feature can sniff out and block “Skeleton Key” attacks, ensuring a safer environment for users and the underlying systems.