Cost Optimization for LLMs

How to potentially save money on web facing LLMs

Apr 13, 2024

a cartoonish 2D wall blocking some robots trying to enter a cloud with a fun, light pastel color scheme and minimalist style — This is what ChatGPT thinks of this article

Do you have web applications where customers can interact with an LLM?

I recently watched an episode of The Routing Loop on AWS’ Twitch TV channel where they discussed how to deploy a WAF to protect LLMs. On the surface, it doesn’t seem like a cost optimization conversation, but I was happy to find that they included an angle all about that in the episode.

I want to take the time to share some of what I learned. I highly recommend giving it a listen!

I didn’t know this going in, but apparently as of 2022, 47% of internet traffic is bot traffic. That’s a wild statistic.

How does that impact you and your organization? Bots scrape websites, interact with web applications, and do all sorts of nefarious and non-nefarious things. But in the process, they’re sending and receiving data.

Why does that matter?

LLMs use tokens as a unit of consumption. Each token is worth roughly 4 characters, or a short word. While many are fairly efficient and cost friendly as you can see below and at the provided link, these costs add up at scale:

If bots are accessing a web facing LLM, you might be paying extra money for bot traffic to interact with your LLM.

Let’s say you’re using the Claude Instant model, the 2nd cheapest one here.

Suppose you had an average request of 1000 tokens and your average output was 150 tokens.

Quick math: ((1000 * $0.0008)/1000) + ((150 * $0.0024)/1000) = $0.00116 per request.

First off, that’s awesome value for money overall considering what you’re getting. However, we’re interested in cost optimization, which means not being content with letting dollars slide!

For every 1000 requests to the LLM, you’d pay $1.16. That’s excellent.

I recently worked with a customer that gets over 400k visits to their online retail store monthly. If they had an LLM that was facing the internet, and if it captured 10% of all viewership on the website (not sure what normal is to be honest) and had someone/something interact with it, that would be 40k requests assuming just a single prompt was entered by the user/bot.

(40,000 * 1.16)/1000 is still small at $46.00/month. Even if the users interacted a few times, we’re still talking ~$90-150/month which is very acceptable.

However, play with these numbers more and you can see how it could get more expensive.

Prompts might have on average a larger number of input tokens, a larger number of output tokens, or simply happen more frequently. You might also use another model, which costs more.

The Claude 2.0/2.1 is 10x as expensive as Claude Instant. So take that number we had earlier, and 10x it. $460.00/month.

Or, let’s take a peek at Jurassic-2 Mid:

Using the same math:

((1000 * $0.0125)/1000) + ((150*$0.0125)/1000) = $0.014 per request. Every 1000 requests would then cost $14.

With that same store, they’d be paying $560 monthly with those earlier criteria. Like I said, you could play with the math and potentially multiply that cost per month if the inputs and outputs increased. All depends on use case.

But even if you were spending $560.00 a month, wouldn’t it be a nice to reduce costs by hundreds of dollars due to bot traffic? Yes it would.

Using a WAF applied at either the Load Balancer or at the CloudFront distribution is a way of cutting down on bot traffic. Since we know that’s 47% of traffic, any reduction there would bring costs down.

AWS WAF with Bot Control is a tool to do that. You can use it to reduce pervasive bots, scrapers, crawlers, and other unwanted bot attention. This has impact on more than LLMs too. If you’re using containers, FaaS, or other infrastructure that’s activated by an incoming session, you could make use of this as well to potentially reduce bot traffic and your bill.

Fortunately, WAF is a cheap and affordable service to use as well, so there’s not much additional cost to run it.

During the episode, they cited the statistic that every $1.00 spent towards bot prevention saves $270.00 in GenAI costs. I can’t confirm that, but if true that’s a compelling reason to consider it.

Looking forward, I would be willing to bet there will be a lot more LLMs set up as customer facing and open to the public, and I’d also be willing to bet that the bots will continue to grow, develop, and become increasingly more sophisticated over time.

With that in mind, it’s always good to look for ways to not only reduce security risks but also save some money while doing so. The intersection of both is awesome when it happens! Take advantage of it in your designs.

Cost Optimized Cloud

Discussion about this post