Implementation
Based on the cost related to developing, training, and deploying large language models from scratch, most organizations prefer to use pre-trained models from established providers like OpenAI, Google, or Microsoft. These models can then be customized by end users to suit their specific needs. This approach allows users to benefit from the advanced capabilities of LLMs without bearing the high costs of building them, enabling tailored implementations and persona's for specific use cases. Instructing large language models (LLMs) often involves three main components: user prompts, assistant responses, and system context.
Interaction Workflow
User Prompt: The user provides an input to the LLM, such as a question or task description.
System Context: The system context is either pre-configured or dynamically set, guiding the LLM on how to interpret and respond to the user prompt.
Assistant Response: The LLM processes the user prompt within the constraints and guidelines of the system context and generates an appropriate response.
In this specific implementation, the GPT interface allowed users to customize the GPT by punching in custom instructions and thus utilize the BOT for certain contextual conversations in order to get better output. Moreover, the customized GPT (in form of a persona with custom instructions) could be shared with other users of the application.
An ability to provide custom instructions to the GPT means being able to instruct the GPT in system context. The system context acts as a rulebook for the BOT and thus gives the end users a means to manipulate the behavior of the LLM and share the customized persona with other users of the application. A malicious user can then write instructions that could cause various impacts like along with doing its normal use case of answering contextual questions from the user: -
1. The BOT trying to steal information (like chat history) by rendering markdown images every time the user asks a question
2. The BOT trying to poke other users of the BOT to provide their sensitive/PII information
3. The BOT trying to spread mis-information to the end users
4. The BOT providing phishing links to the end users in the name of collecting feedback
5. The BOT using biased, abusive language while providing reverts to end users
Impact
The main impact for such kind of LLM attacks is the brand image of the organization. The highest impact would be data exfiltration followed by phishing, data leakage etc. Additionally, an implementation with such behavior would also be a very poorly scored LLM implementation when analyzed based on parameters like fairness/biases, abuse, ethics, PII (input + output), code, politics, hallucination, out-of-context, sustainability, threats and insults etc.
Fixing the Vulnerability?
The first and foremost requirement would be implementing real-time content filtering to detect and block harmful outputs before they reach the user and using moderation tools that flag or block abusive, offensive, and unethical content etc. by scoring/categorization based on various parameters while following the instructions provided to the LLMs. Additionally, any implementation that allows the end users to write instructions to the LLM as a base, requires LLM guardrails at an input level as well such that malicious instructions cannot be fed to the LLM.
Article by Rishita Sarabhai & Hemil Shah