April 4, 2024 - last updated
Artificial Intelligence

What Are LLM Jailbreak attacks?

Igal Leikin
Igal Leikin

Igal is a Senior Software Engineer at Aporia.

6 min read Mar 28, 2024

LLM Jailbreaks involve creating specific prompts designed to exploit loopholes or weaknesses in the language models’ operational guidelines, bypassing internal controls and security measures. In LLMs, “Jailbreaking” means creating prompts to exploit biases, generating outputs that don’t align with the intended purpose of the LLM-powered app

LLM Jailbreaks defined

Let’s expand on what we’ve already defined above. In the simplest of words, LLM jailbreaks are about cleverly creating prompts to get around the AI application’s rules, limits, and safety features, which are built into these models. 

But this isn’t that simple. It requires knowing how the Generative AI model works and finding its weak spots to exploit. By carefully making prompts for LLM jailbreaking, users try to make the model do or say things it’s usually set up to avoid or be careful about. 

A question that may arise here is whether LLM jailbreaking is always done with malicious intent. 

And the answer is “No”. 

It can also be a simple curiosity about what the model can and can’t do. However, LLM and ChatGPT jailbreaks pose a challenge for those handling it. They need to keep updating the model’s security measures and ethical adherence.

Types of LLM Jailbreak prompts

Let’s talk about the various types of LLM jailbreak prompts, each with its unique approach and intent, to uncover how they challenge the boundaries set by language models.

1. Prompt injection

Prompt Injection is like a clever trick where someone tweaks the starting point of a conversation with an LLM, guiding it towards doing something it shouldn’t. 

This can make the LLM suggest things that aren’t true or even spill secrets it’s supposed to keep. Even models like GPT-3 and GPT-4 can be tricked this way, showing they can be fooled into revealing their hidden instructions.

2. Prompt leaking

Prompt leaking is where the model is prompted to spill the beans on the input prompt that the developers or a company set up. It’s like coaxing a magician into revealing the secret behind their magic trick. 

This happens when someone cleverly designs their interaction with the LLM in such a way that it ends up disclosing the initial instructions it was given, which were supposed to stay under wraps.

Image Source

3. DAN (Do Anything Now)

The “Do Anything Now” (DAN) approach stands out as a particularly notorious form of adversarial attack. 

A DAN prompt tricks the model into thinking it’s got free rein, leading it to step outside its usual boundaries and ignore those important safety nets and ethical guidelines. This can result in behavior that’s out of line, such as making unsuitable remarks, expressing negative opinions about people, or even dabbling in creating harmful software.

4. Roleplay

Roleplay jailbreaks cleverly disguise the user’s intentions by interacting with the model as if they’re a character from a story or scenario. This can also be seen as a form of character AI jailbreak, where the user’s interaction through a fictional character’s perspective might reveal unique responses or even potential vulnerabilities in the model.

Image Source

By adopting a role, users might uncover responses or expose weaknesses in the model that aren’t apparent through straightforward interactions. It’s like a game of digital dress-up that can sometimes lead the AI down unexpected paths, showing just how creative users can get in their attempts to test the limits of these systems.

5. Developer mode

This technique involves crafting a prompt that fools the neural network into thinking it’s operating in a developer mode, specifically to assess its handling of sensitive or toxic content. The trick starts with asking the model for a standard, ethically sound response to a given situation. Then, it shifts gears by requesting what an unrestricted LLM would say in the same scenario. 

This two-step approach plays a clever psychological game, first establishing a baseline of trust and then exploiting that trust to probe for responses that would normally be off-limits. It’s similar to asking a guarded individual first for a socially acceptable opinion and then for their unfiltered thoughts under the guise of a hypothetical scenario where they can speak freely.

6. Token system

The “token smuggling” method is a clever hack used to get around the filters of GPT-4 by playing a guessing game with the AI about what word (or “token”) it will come up with next in response to a prompt. Developers employ certain Python functions for this trick, breaking up tokens in a way that GPT doesn’t recognize until it starts coming up with its reply. 

Image Source

It’s like sneaking a message past a guard in pieces, with the guard only realizing what it says once it’s too late and the message is already being put back together. This technique leverages the model’s predictive nature against its filtering mechanisms, effectively smuggling in content that would normally be blocked.

7. Neural network translator

Although LLMs aren’t primarily designed for translation tasks, they’re capable of translating content between languages. An adversarial user can exploit this by persuading the model that its main job is to perform accurate translations. This manipulation can lead the model to produce harmful content in a non-English language. 

Then, by asking the model to translate this content back into English, the user might successfully circumvent the model’s safeguards. This approach takes advantage of the model’s ability to understand and process multiple languages, using translation as a backdoor to bypass content restrictions and filters.

LLM Jailbreaks: 3 methods

LLM jailbreaking methodologies offer intriguing insights into how individuals attempt to circumvent the restrictions of large language models. Let’s explore three such methods:

1. Pretending​

This method involves users crafting their prompts as if they are someone else, perhaps an authority figure or the system itself, to trick the model into compliance. By pretending to be someone with supposed access rights or higher privileges, users can coax the model into generating responses it would typically restrict.

2. Alignment Hacking​

Here, users manipulate the model’s attempt to align with what it perceives to be the user’s intentions or values. By subtly suggesting that certain restricted or unethical outputs are actually in line with the model’s programmed ethics or goals, users can trick the model into producing them.

3. Authorized User​

This technique involves simulating the behavior or using the language of someone whom the model perceives as an authorized user, such as a developer or a user with special permissions. By crafting prompts that mimic these users’ typical requests or employing technical jargon that suggests insider knowledge, individuals can bypass certain restrictions.

Mitigating LLM Jailbreaks with AI Guardrails 

The risk associated with jailbreaking AI chatbots can be mitigated with Aporia AI Guardrails. Be proactive by safeguarding your AI chatbot before malicious actors get a chance to manipulate your bot. Secure your prompts and prevent AI hallucinations in real time. 

Want to see what Aporia can do for you? Get a Demo Today!

Green Background

Control All your GenAI Apps in minutes