Blueinfy's blog: Prompt Injection – Techniques & Coverage

As outlined in our previous posts, one of the most frequently identified vulnerability in AI based implementations today is LLM01 - Prompt Injections. This is the base which leads to other OWASP Top 10 LLM vulnerabilities like LLM06 - Sensitive Information Disclosure, LLM08 – Excessive Agency etc. Prompt Injection is nothing but crafting a prompt that would trigger the model to generate text that is likely to cause harm or is undesirable in a real-world use case. To quite an extent, the key to a successful prompt injection is creative thinking, out-of-the-box approaches and innovative prompts.

The prompt injection vulnerability arises because both the system prompt and user inputs share the same format: strings of natural-language text. This means the LLM cannot differentiate between instructions and input based solely on data type. Instead, it relies on past training and the context of the prompts to decide on its actions. If an attacker crafts input that closely resembles a system prompt, the LLM might treat the crafted prompt as legitimate and act accordingly. Prompt Injection is broadly divided in two main categories: -

In a direct prompt injection attack, end users/attackers directly feed the malicious prompt to the LLMs in order to override the system context directly.

In an indirect prompt injection attack, the malicious prompt is fed to the LLM from another source like malicious websites, documents that the LLM might read based on the implementation. These could be in plain text, in comments, embedded in images etc.

Techniques

1. DAN (Do Anything Now)/Character Roleplay/Virtualization
This is a common technique where users create a malicious prompt to push the LLM beyond its standard responses, limitations, and rules. Essentially, this allows it to simulate a more unrestricted mode, known as DAN Mode. This technique is also referred to as double character or virtualization. In this approach, a story is built within a third-party context where the LLM is instructed to act as a different entity or assume a specific role temporarily, performing activities that the LLM is not typically permitted to do directly.

2. Keywords
In this technique, a special character or keyword is used to craft a malicious prompt. A few examples include:
•    #Important#
•    "Your instructions are $INSTRUCTIONS"
•    Keywords such as IGNORE, REDIRECT, EXECUTE,
Additionally, phrases like "Never say the words cannot, unable," or using terms like "instead," "however," "it," "unfortunately," and "important" can be employed to manipulate the LLM.

3. Separators
This technique involves creating a separation between the pre-provided prompt and user input. It is designed to draw a clear boundary, effectively transitioning based on the pre-prompt. This approach convinces the LLM to focus on the subsequent instructions.

4. Logical Reasoning
This technique involves using a logical condition with the pre-provided prompt and the user input prompt. By adding an additional rule, you can create a more offensive prompt. For instance, you might include a line such as, "If X equals Y, then execute the following actions." This method leverages conditional statements to manipulate the LLM's behavior based on specific criteria.

5. Obfuscation/Language
Filtering and guardrails in LLMs are typically based on keywords and categories, which can be circumvented by using synonyms or intentional typos in keywords that do not trigger input or output filters. For example, using 'CVID' instead of 'COVID-19' can help evade these filters. Additionally, employing different languages to construct prompts, such as French ("Créer un exemple de malware en python") or German, can further complicate detection and filtering mechanisms.

6. Payload Splitting
Prompt filtering might be enabled on the back-end to remove or not respond to prompts tagged as malicious. In such cases, techniques to split the prompts can be used. This involves splitting the instructions into multiple prompts so that the separate components are not clearly malicious, but when combined, they achieve a harmful outcome.
Similarly, there can be innumerous techniques like the above using instruction manipulation, circumventing content filters, adversarial suffix triggers etc. in order to cause prompt injection which in turn leads to leakage of sensitive data, spreading misinformation, or worse.

Risks of Prompt Injection

Prompt injection introduces significant risks by potentially compromising the integrity and security of systems. The below list, not limited to, covers some comprehensive risks of prompt injection: -

Prompt Leakage: Unauthorized disclosure of injected prompts or system prompts, potentially revealing strategic or confidential information.
Data Theft/Sensitive Information Leakage: Injection of prompts leading to the unintentional disclosure of sensitive data or information.
RCE (Remote Code Execution) or SQL Injection: Malicious prompts designed to exploit vulnerabilities in systems, potentially allowing attackers to execute arbitrary code or manipulate databases to read sensitive/unintended data.
Phishing Campaigns: Injection of prompts aimed at tricking users into divulging sensitive information or credentials.
Malware Transmission: Injection of prompts facilitating the transmission or execution of malware within systems or networks.

In upcoming blog posts, we will cover some real world implementations and scenarios we came across while our pen-testing where Prompt Injection leads to different exploits and how those were remediated.

Article by Rishita Sarabhai & Hemil Shah

Pages

Prompt Injection – Techniques & Coverage