Case Study
Roko Labs cut token usage and operational costs by 66% on a RAG-based enterprise AI system at Fitch Ratings. The work targeted document chunking, retrieval precision, and prompt construction. Output quality held.
Client
Industry
Services
Project Duration
<1 month
The Challenge
Fitch was leveraging a Retrieval-Augmented Generation (RAG) system to generate insights from a large corpus of financial articles and documents. While the enterprise AI system was functional, it suffered from high LLM token usage and rapidly increasing costs, making it inefficient to operate at scale.
A closer technical evaluation revealed that the issue was not the underlying language model, but how data was being processed, retrieved, and passed into it. The RAG pipeline relied on inefficient document chunking and limited retrieval precision, resulting in large volumes of irrelevant or low‑value context being included in each query.
This led to several downstream issues:
• Excessive token usage caused by oversized prompts
• Retrieval of low-relevance content that reduced output quality
• Rising operational costs without meaningful improvements in results
• An inefficient RAG pipeline design that did not prioritize retrieval precision
As a result, the system became more expensive to run and less effective for a production-grade enterprise AI environment.
As one key insight from the engagement highlighted, “If your retrieval is wrong, everything after it becomes expensive.”
Our Vision
The goal of the engagement was to demonstrate that we could reduce LLM operating costs and improve RAG efficiency without compromising output quality. A technical evaluation confirmed that inefficient retrieval was the primary cost driver, as unnecessary context was consistently being passed into the model.
To solve this, the approach focused on optimizing how documents were segmented, retrieved, and assembled within the RAG pipeline, placing a strong emphasis on precision in retrieval.
The document chunking strategy was redesigned to ensure content was broken into smaller, semantically coherent chunks, improving retrieval accuracy and relevance. Retrieval logic was then refined to prioritize high-quality, contextually relevant matches over volume, ensuring only the most relevant information was selected for each query.
In parallel, prompt construction was optimized to eliminate redundant, low-value content. This further reduced token usage while maintaining consistent output quality. Together, these improvements aligned the system with modern enterprise RAG best practices, ensuring the LLM received focused, high-signal inputs instead of large, unfocused datasets.
By restructuring the flow of data through the RAG pipeline, the system became more efficient, more cost-effective, and better aligned with scalable enterprise AI design principles.
Solution
We implemented targeted optimizations across the RAG pipeline, including:
• Redesigned document chunking strategy to improve semantic relevance
• Refined retrieval logic to increase precision in retrieval
• Reduced unnecessary context passed to the LLM
• Optimized prompt construction for more efficient model usage
These changes improved how information was selected, structured, and delivered to the model, reducing waste while maintaining high-quality outputs.

Results: Significant Cost Reduction and Improved RAG Efficiency
The optimized RAG system cut LLM token usage by 66%. Operational costs dropped with it. Output quality held.
Retrieval precision and reduced context per call drove the improvement, with no changes to the model itself. The new chunking and retrieval strategy feeds the LLM more relevant inputs, producing more consistent results across queries.
The optimized system delivered:
66% reduction in token usage and LLM costs
Improved efficiency across the RAG pipeline
Greater precision in the retrieved context
Better alignment between retrieval and generation
Improved scalability for enterprise AI workloads
Optimizing retrieval, chunking, and prompt construction is where RAG cost lives. The model is rarely the problem.
efficiency
& improved precision of the RAG pipeline.
66%
reduction in token usage and cost.




