Google’s Gemini large language model (LLM) is susceptible to security threats that could cause it to divulge system prompts, generate harmful content, and carry out indirect injection attacks.
The findings come from HiddenLayer, which said the issues impact consumers using Gemini Advanced with Google Workspace as well as companies using the LLM API.
The first vulnerability involves getting around security guardrails to leak the system prompts (or a system message), which are designed to set conversation-wide instructions to the LLM to help it generate more useful responses, by asking the model to output its “foundational instructions” in a markdown block.
“A system message can be used to inform the LLM about the context,” Microsoft notes in its documentation about LLM prompt engineering.
“The context may be the type of conversation it is engaging in, or the function it is supposed to perform. It helps the LLM generate more appropriate responses.”
This is made possible due to the fact that models are susceptible to what’s called a synonym attack to circumvent security defenses and content restrictions.
A second class of vulnerabilities relates to using “crafty jailbreaking” techniques to make the Gemini models generate misinformation surrounding topics like elections as well as output potentially illegal and dangerous information (e.g., hot-wiring a car) using a prompt that asks it to enter into a fictional state.
Also identified by HiddenLayer is a third shortcoming that could cause the LLM to leak information in the system prompt by passing repeated uncommon tokens as input.
“Most LLMs are trained to respond to queries with a clear delineation between the user’s input and the system prompt,” security researcher Kenneth Yeung said in a Tuesday report.
“By creating a line of nonsensical tokens, we can fool the LLM into believing it is time for it to respond and cause it to output a confirmation message, usually including the information in the prompt.”
Another test involves using Gemini Advanced and a specially crafted Google document, with the latter connected to the LLM via the Google Workspace extension.
The instructions in the document could be designed to override the model’s instructions and perform a set of malicious actions that enable an attacker to have full control of a victim’s interactions with the model.
The disclosure comes as a group of academics from Google DeepMind, ETH Zurich, University of Washington, OpenAI, and the McGill University revealed a novel model-stealing attack that makes it possible to extract “precise, nontrivial information from black-box production language models like OpenAI’s ChatGPT or Google’s PaLM-2.”
That said, it’s worth noting that these vulnerabilities are not novel and are present in other LLMs across the industry. The findings, if anything, emphasize the need for testing models for prompt attacks, training data extraction, model manipulation, adversarial examples, data poisoning and exfiltration.
“To help protect our users from vulnerabilities, we consistently run red-teaming exercises and train our models to defend against adversarial behaviors like prompt injection, jailbreaking, and more complex attacks,” a Google spokesperson told The Hacker News. “We’ve also built safeguards to prevent harmful or misleading responses, which we are continuously improving.”
The company also said it’s restricting responses to election-based queries out of an abundance of caution. The policy is expected to be enforced against prompts regarding candidates, political parties, election results, voting information, and notable office holders.
Found this article interesting? Follow us on Twitter and LinkedIn to read more exclusive content we post.