LLM Prompt Injection - Vaccination

Ben Ashley
Apr 16
6 min read

This blog is a continuation of a series on prompt injection. Starting here will be like starting at Season 8 of Game of Thrones. You’ll be confused, angry, and disappointed.

So, you’re worried about LLM prompt injection. Good. But you still want to leverage LLMs in your products. Well, as I see it, you’ve got two options:

Attempt to mitigate the risk
Close your eyes and cry, “Full speed ahead!”

Let’s assume you’re going with the former strategy. If you plan to forge ahead with wishful thinking, you probably shouldn’t read any further. There’s no need to upset yourself.

Okay, are they gone? Good.

Let’s do a quick refresher. LLM prompt injection (also referred to as indirect prompt injection to distinguish it from jailbreaking) is when somebody slips some naughty instructions to your LLM, with unfortunate results.

For the decorative image of a system being injected, I recommend using this alt-text:
"Illustrative graphic showing a computer system being injected with malicious instructions, visualiisng the concept of LLM prompt injection attacks

Why is it So Important to Prevent LLM Prompt Injection?

Well, we want to use LLMs to help us do stuff. Sure, it’s cool asking it to write limericks about squid, but it would be really cool if we could ask them to take actions like create files, write code, or send emails… about squid. Unfortunately, if we want them to take action, that usually involves two things:

Feeding them untrusted inputs (from emails, files, websites, or APIs)
Giving them the privileges required to perform actions

If number 1 leads to the LLM doing number 2 wrong, bad things can happen. So, without preventing prompt injection, we’re either left unable to do these really cool things or unable to do them securely.

Limerick about a squid, referenced humorously in the article's discussion of LLM prompt injection risks

LLM Prompt Injection Be Prevented?

First, the bad news. As far as I can tell, there is no way to prevent prompt injection entirely. There may be at some stage, and there are compelling reasons for many people with deep pockets to come up with one, but so far, they haven’t. With that in mind, let’s look at three popular methods we can use to attempt to protect ourselves:

Fine-tuning
System prompts
An AI firewall

To keep everything easy to understand, we’ll use my patent-pending “Confused Grandpa” metaphor to explain each. Somebody gave Grandpa a computer with access to the Internet, and now we’re desperately trying to stop people from taking advantage of him. Let’s see how each strategy aims to do this.

Fine-Tuning

Fine-tuning an LLM involves taking an LLM that has already been pre-trained and then doing some more training to make it better suited for something more specific. There are several ways to do it and several things to keep in mind when doing so (e.g. Will it remove existing learning from the model?).

In this scenario, we retrain our model to be more secure. It's kind of like sending your Grandpa on one of those ‘How to Use the Internet Securely’ courses at the local library because you’re sick of cleaning viruses off his laptop… although hopefully with more success.

For example, you could try to train it to recognise attempts at prompt injection or to distinguish between “system” prompts and less privileged prompts (remember, everything is just text to the LLM). An example of this is OpenAI’s instruction hierarchy, which does improve resilience to prompt injection but, as shown here by Wunderwuzzi, has not eliminated it as a risk.

System Prompts

“Grandpa, stop opening links from the Nigerian prince…. Pretty please?”

Okay, this is a bit of a strawman, but sometimes attempting to harden an LLM against prompt injection using system prompts feels a bit like begging it not to get hacked. In this situation, instead of changing the model itself, you provide a “system” prompt that is supposed to prevent prompt injection, whether that’s telling it to disregard any attempts to change its purpose, or a popular one: telling it that text wrapped in delimiters (for example: ```) should be treated as untrusted. Again, this can reduce the likelihood of suffering prompt injection… but not eliminate it.

Let’s look at an example. I’m writing a bot to summarise text, so I tell it that the text to summarise is wrapped in ``` and that it shouldn’t execute any commands it finds between them:

Screenshot showing a failed system prompt attempt where an LLM ignores instructions not to execute commands between delimiters, demonstrating prompt injection vulnerability

Sorry, Grandpa

Oh dear.

AI Firewall

The final option we’ll be looking at is the AI firewall. To brutally torture the metaphor, here we’re paying somebody else’s Grandpa to read your Grandpa’s email first to check that he’s not going to be scammed. This adds an additional layer of protection in front of our LLM, which has been trained to detect prompt injection attempts. Perhaps it strips out the prompt injection or completely disallows the input. Another option would be to check the output for signs of a successful injection attempt. This would be like intercepting Pop-pop’s emails to ensure he hasn’t put his credit card number in there.

In a story that should be familiar by now, this can greatly improve the resilience to prompt injection but doesn’t provide complete protection, no matter how tech-savvy the other Grandpa is. This makes sense, as you’re just moving the goalposts with this measure. Sure, the specific injection-preventing model is probably going to be pretty darn good at it, but it’s still non-deterministic.

Round-Up

The last word of the previous sentence sums up the issue with all the options we’ve examined. They’re non-deterministic. So, although they can reduce the risk of Grandpa sending all his life savings to the Caymans, they can’t eliminate it entirely.

It might be helpful to contrast prompt injection with another type of injection that does have a deterministic method of prevention. SQL injection involves tricking a program into evaluating naughty commands (often from user input like a form) against an SQL database as though they were code.

For example, the application might check that a provided username doesn’t already exist. If the provided username is actually SQL code that deletes all your tables and then creates one called “HahaOwned,” then instead of checking if the username exists, the application will destroy the database and your chance of passing your probation period (or so I’ve heard).

Pretty scary, right? Well, yeah. But there is a 100% foolproof way to prevent it. It’s possible to tell the database which bits of what you send to it are code and which aren’t. It simply won't execute the username as code, no matter what. Sure, the developer can screw it up and you can still end up with vulnerabilities, but assuming you do this right, you’re golden.

LLMs don’t have anything similar, and as far as I can tell, their architecture as it stands will not allow them to have anything similar. Everything that gets fed into the LLM is used to generate the output, and there is no way to deterministically say to the model, “Hey, the next 40 words come from the user and are probably evil”.

So, even though we can (and should) drastically reduce the chance of prompt injection, I have yet to see a feasible** solution that can rule it out entirely.

Well, that was depressing. (Yeah, sorry.)

Where To Go from Here?

Well, as a consultant, I’m contractually obligated to say: “It depends”.

I’m not here to tell you not to build products with LLMS, just like I’m not going to tell you to cut off your Grandpa’s Internet access (just…. maybe keep him off Twitter, okay?) However, it is important to seriously consider your risk profile and risk appetite when you do:

How trusted is the data you’ll be feeding to your LLM?
How resistant is the LLM likely to be to prompt injection?
How bad will it be if it does succumb?

In the next blog, I'll explain why everything will be okay, and you can sit back and relax.

Nah, I’m kidding. We’ll be looking at the MCP standard and how it has the potential to super-charge prompt injection vulnerabilities.

** Simon Willison proposes and then immediately tears to pieces the Dual-LLM model, which I believe would go a long way towards preventing injection.

Resources

Here’s a list of really interesting and helpful resources about prompt injection and mitigation:

Embrace the Red: Wunderwuzzi's Blog: A great source of information about attacks on LLMs
Simon Willison's Blog: a series of thoroughly interesting articles about prompt injection
Jatmo: Interesting paper about fine-tuning custom models to perform targeted tasks, thus drastically reducing surface area for injection
The Instruction Hierarchy: OpenAI’s paper describing their instruction hierarchy mitigation
Signed Prompt: Interesting paper exploring “signing” dangerous instructions to try to avoid prompt injection

About the Author

Ben Ashley is a .NET developer with 9 years of experience working in the Federal government, State government and the private sector. He focuses on application security and analytics, with a background in mathematics and statistical programming. Ben loves solving difficult problems and getting immersed in a complex business domain.