4 Techniques to Optimize Your LLM Prompts for Cost, Latency and Performance

of automating a significant number of tasks. Since the release of ChatGPT in 2022, we have seen more and more AI products on the market utilizing LLMs. However, there are still a lot of improvements that should be made in the way we utilize LLMs. Improving your prompt with an LLM prompt improver and utilizing cached tokens are, for example, two simple techniques you can utilize to vastly improve the performance of your LLM application.

In this article, I’ll discuss several specific techniques you can apply to the way you create and structure your prompts, which will reduce latency and cost, and also increase the quality of your responses. The goal is to present you with these specific techniques, so you can immediately implement them into your own LLM application.

This infographic highlights the main contents of this article. I’ll discuss four different techniques to greatly improve the performance of your LLM application, with regard to cost, latency, and output quality. I’ll cover utilizing cached tokens, having the user question at the end, using prompt optimizers, and having your own customized LLM benchmarks. Image by Gemini.

Why you should optimize your prompt

In a lot of cases, you might have a prompt that works with a given LLM and yields adequate results. However, in a lot of cases, you haven’t spent much time optimizing the prompt, which leaves a lot of potential on the table.

I argue that using the specific techniques I’ll present in this article, you can easily both improve the quality of your responses and reduce costs without much effort. Just because a prompt and LLM work doesn’t mean it’s performing optimally, and in a lot of cases, you can see great improvements with very little effort.

Specific techniques to optimize

In this section, I’ll cover the specific techniques you can utilize to optimize your prompts.

Always keep static content early

The first technique I’ll cover is to always keep static content early in your prompt. With static content, I refer to content that remains the same when you make multiple API calls.

The reason you should keep the static content early is that all the big LLM providers, such as Anthropic, Google, and OpenAI, utilize cached tokens. Cached tokens are tokens that have already been processed in a previous API request, and that can be processed cheaply and quickly. It varies from provider to provider, but cached input tokens are usually priced around 10% of normal input tokens.

Cached tokens are tokens that have already been processed in a previous API request, and that can be processed cheaper and faster than normal tokens

That means, if you send in the same prompt two times in a row, the input tokens of the second prompt will only cost 1/10th the input tokens of the first prompt. This works because the LLM providers cache the processing of these input tokens, which makes processing your new request cheaper and faster.

In practice, caching input tokens is done by keeping variables at the end of the prompt.

For example, if you have a long system prompt with a question that varies from request to request, you should do something like this:

prompt = f"""
{long static system prompt}

{user prompt}
"""

For example:

prompt = f"""
You are a document expert ...
You should always reply in this format ...
If a user asks about ... you should answer ...

{user question}
"""

Here we have the static content of the prompt first, before we put the variable contents (the user question) last.

In some scenarios, you want to feed in document contents. If you’re processing a lot of different documents, you should keep the document content at the end of the prompt:

# if processing different documents
prompt = f"""
{static system prompt}
{variable prompt instruction 1}
{document content}
{variable prompt instruction 2}
{user question}
"""

However, suppose you’re processing the same documents multiple times. In that case, you can make sure the tokens of the document are also cached by ensuring no variables are put into the prompt beforehand:

# if processing the same documents multiple times
prompt = f"""
{static system prompt}
{document content} # keep this before any variable instructions
{variable prompt instruction 1}
{variable prompt instruction 2}
{user question}
"""

Note that cached tokens are usually only activated if the first 1024 tokens are the same in two requests. For example, if your static system prompt in the above example is shorter than 1024 tokens, you’ll not utilize any cached tokens.

# do NOT do this
prompt = f"""
{variable content} < --- this removes all usage of cached tokens
{static system prompt}
{document content}
{variable prompt instruction 1}
{variable prompt instruction 2}
{user question}
"""

Your prompts should always be built up with the most static contents first (the content varying the least from request to request), the the most dynamic content (the content varying the most from request to request)

If you have a long system and user prompt without any variables, you should keep that first, and add the variables at the end of the prompt
If you are fetching text from documents, for example, and processing the same document twice, you should

Could be document contents, or if you have a long prompt -> make use of caching

Question at the end

Another technique you should utilize to improve LLM performance is to always put the user question at the end of your prompt. Ideally, you organize it so you have your system prompt containing all the general instructions, and the user prompt simply consists of only the user question, such as below:

system_prompt = ""

user_prompt = f"{user_question}"

In Anthropic’s prompt engineering docs, the state that includes the user prompt at the end can improve performance by up to 30%, especially if you are using long contexts. Including the question in the end makes it clearer to the model which task it’s trying to achieve, and will, in many cases, lead to better results.

Using a prompt optimizer

A lot of times, when humans write prompts, they become messy, inconsistent, include redundant content, and lack structure. Thus, you should always feed your prompt through a prompt optimizer.

The simplest prompt optimizer you can use is to prompt an LLM to improve this prompt {prompt}, and it will provide you with a more structured prompt, with less redundant content, and so on.

An even better approach, however, is to use a specific prompt optimizer, such as one you can find in OpenAI’s or Anthropic’s consoles. These optimizers are LLMs specifically prompted and created to optimize your prompts, and will usually yield better results. Furthermore, you should make sure to include:

Details about the task you’re trying to achieve
Examples of tasks the prompt succeeded at, and the input and output
Example of tasks the prompt failed at, with the input and output

Providing this additional information will usually yield way better results, and you’ll end up with a much better prompt. In many cases, you’ll only spend around 10-15 minutes and end up with a way more performant prompt. This makes using a prompt optimizer one of the lowest effort approaches to improving LLM performance.

Benchmark LLMs

The LLM you use will also significantly impact the performance of your LLM application. Different LLMs are good at different tasks, so you need to try out the different LLMs on your specific application area. I recommend at least setting up access to the biggest LLM providers like Google Gemini, OpenAI, and Anthropic. Setting this up is quite simple, and switching your LLM provider takes a matter of minutes if you already have credentials set up. Furthermore, you can consider testing open-source LLMs as well, though they usually require more effort.

You now need to set up a specific benchmark for the task you’re trying to achieve, and see which LLM works best. Additionally, you should regularly check model performance, since the big LLM providers occasionally upgrade their models, without necessarily coming out with a new version. You should, of course, also be ready to try out any new models coming out from the large LLM providers.

Conclusion

In this article, I’ve covered four different techniques you can utilize to improve the performance of your LLM application. I discussed utilizing cached tokens, having the question at the end of the prompt, using prompt optimizers, and creating specific LLM benchmarks. These are all relatively simple to set up and do, and can lead to a significant performance increase. I believe many similar and simple techniques exist, and you should always try to be on the lookout for them. These topics are usually described in different blog posts, where Anthropic is one of the blogs that has helped me improve LLM performance the most.

👉 Find me on socials:

📩 Subscribe to my newsletter

🧑‍💻 Get in touch

🔗 LinkedIn

🐦 X / Twitter

✍️ Medium

You can also read some of my other articles:

Source link