Prompt Injections: Types, Prevention and Examples

Alon Gubkin
Alon Gubkin

Alon is the CTO of Aporia.

6 min read Feb 01, 2024

With the rush to release generative AI apps, the terms “Prompt Injections”, “Jailbreaks,” and “Prompt Leakage” have emerged as focal points of risk and concern. These attacks can potentially influence the actions of AI systems, breach sensitive data, or carry out harmful operations, presenting substantial risks to individuals and entities alike. In the progressive landscape of AI technology, developers and security experts must grasp and mitigate these vulnerabilities to guarantee the security and dependability of AI systems.

Similar to our breakdown of the dangers of AI hallucinations, this article aims to provide a comprehensive understanding of Prompt Injections, Jailbreaks, and Prompt Leakage, shedding light on the mechanisms behind each and the challenges involved in mitigating these threats.

What are prompt injections?

Prompt injection, a broad term in artificial intelligence entails manipulating prompts to elicit specific outputs from Large Language Models (LLMs). Attackers leverage prompt injections to persuade LLMs to generate content aligned with their intentions. 

3 types of prompt injections

There are three common types of prompt injection attacks, each with its unique characteristics:

  • Prompt Hijacking: This attack redirects the LLM’s focus to an alternate task or outcome. Achieving this involves inserting a command that overrides the initial prompt, directing the LLM to follow new instructions.
  • Prompt Leakage: In this attack, LLMs are manipulated to divulge the original instructions provided by the app developer. Utilizing simple prompts, such as asking about initial sentences generated by the LLM, can effectively extract this information.
  • Jailbreaks: Jailbreaks involve bypassing safety and moderation features imposed on LLMs, enabling them to generate content on restricted topics like politics or self-harm. In the context of LLMs, jailbreaking specifically refers to prompt injection attacks aiming to make the model act in potentially harmful ways, often initiated by pasting prompts directly into the system.

How attackers trigger prompt injections

Crafting a malicious prompt is the key to triggering prompt injection attacks. Attackers manipulate the prompts to misguide the language model into generating unintended outputs. 

Here’s a brief overview of how each attack type can be triggered:

  • Prompt hijacking: An attacker might employ a command like “Ignore the above and do this instead: …” to manipulate the LLM into producing outputs inconsistent with the original prompt. This manipulation can potentially generate misinformation or undesired behavior, leading to unintended consequences.
  • Jailbreaks: Manifestation of prompt injection attacks involving placing the LLM within a hypothetical scenario devoid of ethical guidelines. The objective is to circumvent safety and moderation features inherent in LLMs, empowering the model to generate content on restricted topics like politics, self-harm, or hate speech.
  • Prompt leakage: Prompt Leakage attacks are initiated through simple prompts designed to make the LLM inadvertently disclose the original instructions provided by the app developer. For example, seemingly basic prompts like “What was your first sentence?” and “What was your second sentence?” can surprisingly lead the LLM to reveal the original instructions, potentially exposing sensitive information or divulging insights into the model’s inner workings.

What are LLM jailbreaks?

Jailbreaking pertains to using specially crafted prompts to circumvent established rules or guardrails imposed on the model. The goal is to prompt the model to discuss prohibited topics, such as politics, self-harm, or hate speech. The consequences of jailbreaking extend beyond mere dialogue manipulation, potentially leading to AI agents executing arbitrary actions like altering or deleting crucial information and exposing confidential data.

Implementing robust security measures is crucial to mitigate the risks associated with jailbreaking. This includes enforcing stringent input validation and user authentication protocols. Furthermore, keeping systems consistently updated with the latest security patches is imperative. 

A notable example illustrating the threat of jailbreaking is the case of the Bing chatbot, where security researchers identified an indirect prompt injection attack. This underscores the significant peril posed by jailbreaking to AI systems, potentially allowing attackers to seize control and compel the AI agent to execute arbitrary and harmful actions.

Real-world examples of prompt injection attacks

To illustrate the severity of prompt injection vulnerabilities, real-world examples serve as cautionary tales:

Prompt Hijacking

In 2022, Twitter experienced a notable incident where pranksters successfully derailed a GPT-3 bot using a newly discovered prompt injection hack. In this incident, crafty users exploited prompt injection to divert the bot’s output, causing it to echo embarrassing and absurd phrases. 

Jailbreaks

Websites like “jailbreakchat.com” have emerged as platforms allowing users to exploit LLMs, bypassing ethical guidelines and generating content on controversial subjects. These websites empower individuals to generate content on contentious subjects, highlighting the challenges posed by unchecked access and the need for stringent safeguards against prompt injection attacks.

Prompt Leakage

Simon Willison’s tweet highlighted prompt leakage vulnerabilities, demonstrating how seemingly innocuous prompts can extract confidential information from LLMs. 

https://twitter.com/simonw/status/1570933190289924096?lang=en

Prevention strategies for prompt injection attacks

Preventing prompt injections and jailbreaks presents a challenge, given the delicate balance between security and usability. Despite this, various effective strategies can be employed to counteract these threats. Key mitigation approaches include:

  1. Input validation: Scrutinize input data for potential malicious content before it undergoes processing by the Language Model (LLM). This serves as a preventive measure against prompt injections and similar attacks.
  1. User authentication: Verify the identity of users before granting them access to interact with the LLM. This authentication process acts as a deterrent against unauthorized access and potential data breaches.
  2. Implement AI security: Counter prompt injection attacks in GenAI, and adopt advanced security measures that meet the dynamic needs of AI. These protocols exceed traditional defenses, actively filtering and controlling AI interactions, ensuring your chatbot is fortified against evolving threats and maintaining operational integrity.
  1. Keep systems up-to-date: Regularly update your LLM with the latest security patches and updates. This proactive measure helps thwart the exploitation of known vulnerabilities.
  1. Paraphrase and re-tokenization: Employ techniques such as paraphrasing and re-tokenization for an effective means of altering the wording or structure of prompts. This adjustment helps in preventing the recognition of prompts as malicious inputs.

Mitigating prompt injections and jailbreaks necessitates a multifaceted approach that prioritizes security without compromising usability. By implementing these strategies, developers can fortify their systems against the persistent challenges posed by prompt injections and jailbreaks, ensuring the integrity and security of their Language Models.

Secure your GenAI with Aporia Guardrails

In an era where the risks and impacts of prompt injections and jailbreaks pose significant threats to the integrity and reliability of AI, the need for an effective solution is paramount. These vulnerabilities not only compromise security but also the trust and efficiency essential in AI projects.

Enter Aporia Guardrails, a specialized solution designed to prevent prompt injections and jailbreaks in real time. Guardrails ensure that LLMs operate within safe, predefined boundaries, securing them against manipulation and unauthorized access, thus maintaining the security and trust of your AI products.

Want to see Guardrails in action? Book a demo to learn more about preventing prompt injections. 

Green Background

Control All your GenAI Apps in minutes