Reduce Cloud AI Sticker Shock with LLM Optimizations

As businesses take to large language models (LLMs) to power productivity-saving AI chatbots and content generation, the biggest hurdle is overcoming costs. According to a recent industry survey, enterprise cloud costs rose 30% in 2024, and 72% of IT and financial leaders said cloud costs were becoming unmanageable, all primarily due to AI.¹

Pennies per query for API calls to cloud-hosted LLMs can quickly turn into tens of thousands of dollars per month if left unchecked. This puts businesses in a double-bind: they can’t risk falling behind competitors who embrace AI, but it may be too costly to embrace AI themselves. The good news is that in the last few years, AI developers have discovered multiple techniques for optimizing LLMs to run leaner, faster, and cheaper. Understanding the value and importance of these techniques can help businesses curb costs and stay in the AI game for good.

Method 1: Cache Repetitive Queries

For any AI chatbot in any industry, there are a subset of commonly asked questions, and these don’t need to have their responses regenerated time and time again. By caching repetitive queries and their responses, and having the initial prompts run through a filter that checks for common phrases, AI services can return a cached response rather than calling the LLM. This saves a lot of budget and time since cached replies are instant. Using this technique, also known as semantic caching, researchers were able to reduce LLM API calls by up to 68.8% across various query categories.²

Method 2: Deploy Two-Tier Architecture

Two-tier architecture, also known as a multi-LLM strategy, combines an expensive, multi-billion-parameter model with a cheaper one featuring significantly less parameters. When users send their queries, the system classifies them as simple or complex and sends the simple queries to the cheaper LLM while sending the complex queries to the more expensive LLM. The reasoning for this should be obvious: not every question users have requires the heavyweight inference of a hundred-billion-parameter LLM.

Method 3: Shrink Model Bits with Quantization

Data is stored in bits, and the more bits you have, the more detailed the data can be. Most LLMs store parameters using 32- or 16-bit numbers, but these bits can be reduced down to 8- or 4-bit numbers in a process known as quantization. This means the model’s size can be reduced by as much as eight times, although there is a tradeoff in terms of model accuracy because the parameter data isn’t as precise. In many cases, this tradeoff is acceptable, especially when organizations combine quantization with other techniques such as retrieval-augmented generation (RAG) to ensure the most important questions are pulling answers verbatim from a proprietary knowledge base.

Just how much accuracy loss also depends on the use case. Red Hat recently tested quantization efforts using various sizes of Llama 3.1 and found that quantized versions achieved a 96% recovery rate (as in, the rate at which the quantized model reached the same answer or solution as the non-quantized model) for academic tasks like grade school math and reasoning.³ For generating functional code, which is becoming the key focus of many tech companies, a 4-bit quantized model achieved a recovery rate of 98.9%.³ While those look like A+ numbers, keep in mind this means an additional 1.1-4% chance of generating an inaccurate answer, multiplied by however many thousand queries an LLM might process per day.

Method 4: Pre-Compute, Pre-Fetch Answers

Similar to semantic caching, pre-computing responses to anticipated queries can help organizations balance workloads and navigate demand spikes while providing some users with instant responses. For example, generating daily reports in productivity software, or weather reports for weather apps tends to be a consistent and predictable occurrence. Organizations can also generate responses leading up to specific events, such as new product launches, political rallies or elections, or anything in sports, media, and entertainment. By pre-computing and caching these responses, teams can leave their LLMs open for other tasks once the event finally happens.

Method 5: Train Student Models with Teacher Models

In the two-tier architecture approach described previously, an expensive LLM runs alongside a cheaper model and user queries are routed to each model based on query complexity. In the teacher-student approach, organizations use the expensive LLM to generate synthetic training data that the smaller model actively trains on and consumes. The training data should be domain-specific, focused on Q&A pairs, explanations, and classifications across key topics that end users are likely to query. The result is a student LLM that’s smaller and cheaper to run but captures most of the original accuracy of the teacher model on select subject matter.

The Best Approach Is Multiple Approaches

There’s no rule that says you can’t apply all these techniques together. Savvy AI developers are already layering their optimizations, with one team reportedly reducing their LLM API expenses by 80-90%.⁴ Each dollar saved is more that can be channeled into services, innovation, and the bottom line, along with disciplined FinOps to keep businesses running smoothly as they experiment with and scale up AI.

GenAI and AI Drive Cloud Expenses 30% Higher and 72% Say Spending is Unmanageable, Tangoe, October 2024.
Sajal Regmi and Chetan Phakami Pun, GPT Semantic Cache: Reducing LLM Costs and Latency via Semantic Embedding Caching, arXiv, December 2024.
Eldar Kurtić et al., We ran over half a million evaluations on quantized LLMs—here’s what we found, Red Hat, October 2024.
Adnan Masood, Deploying LLMs in Production: Lessons from the Trenches, Medium, July 2025.

Reduce Cloud AI Sticker Shock with LLM Optimizations

Services

Learn More

Services

Learn More

Stay Connected