Case Study
Reducing LLM Costs by 66% Through RAG Pipeline Optimization
Our team at Roko Labs helped optimize an RAG-based enterprise AI system at Fitch Ratings by improving document chunking, retrieval precision, and prompt construction within the RAG pipeline. These changes reduced unnecessary context passed to the LLM, resulting in a 66% reduction in token usage and operational costs while maintaining output quality and improving overall system efficiency.
Case Study
Reducing LLM Costs by 66% Through RAG Pipeline Optimization
Our team at Roko Labs helped optimize an RAG-based enterprise AI system at Fitch Ratings by improving document chunking, retrieval precision, and prompt construction within the RAG pipeline. These changes reduced unnecessary context passed to the LLM, resulting in a 66% reduction in token usage and operational costs while maintaining output quality and improving overall system efficiency.

Client
Fitch Ratings
https://www.fitchratings.com/
Industry
Financial Services
Services
AI Due Diligence
Project Duration
<1 month

Client
Fitch Ratings
https://www.fitchratings.com/
Industry
Financial Services
Services
AI Due Diligence
Project Duration
<1 month

Client
Fitch Ratings
https://www.fitchratings.com/
Industry
Financial Services
Services
AI Due Diligence
Project Duration
<1 month
The Challenge
Fitch was leveraging an Retrieval-Augmented Generation (RAG) system to generate insights from a large corpus of financial articles and documents. While the enterprise AI system was functional, it suffered from high LLM token usage and rapidly increasing costs, making it inefficient to operate at scale. A closer technical evaluation revealed that the issue was not the underlying language model, but how data was being processed, retrieved, and passed into it. The RAG pipeline relied on inefficient document chunking and limited precision in retrieval, which resulted in large volumes of irrelevant or low‑value context being included in each query. This led to several downstream issues: • Excessive token usage caused by oversized prompts • Retrieval of low relevance content that reduced output quality • Rising operational costs without meaningful improvements in results • An inefficient RAG pipeline design that did not prioritize retrieval precision As a result, the system became more expensive to run and less effective for a production-grade enterprise AI environment. As one key insight from the engagement highlighted, “If your retrieval is wrong, everything after it becomes expensive.”
Our Vision
The goal of the engagement was to demonstrate that we could reduce LLM operating costs and improve RAG efficiency without compromising output quality. A technical evaluation confirmed that inefficient retrieval was the primary cost driver, as unnecessary context was consistently being passed into the model. To solve this, the approach focused on optimizing how documents were segmented, retrieved, and assembled within the RAG pipeline, placing a strong emphasis on precision in retrieval. The document chunking strategy was redesigned to ensure content was broken into smaller, semantically coherent chunks, improving retrieval accuracy and relevance. Retrieval logic was then refined to prioritize high-quality, contextually-relevant matches over volume, ensuring only the most relevant information was selected for each query. In parallel, prompt construction was optimized to eliminate redundant, low-value content. This further reduced token usage while maintaining consistent output quality. Together, these improvements aligned the system with modern enterprise RAG best practices, ensuring the LLM received focused, high-signal inputs instead of large, unfocused datasets. By restructuring the flow of data through the RAG pipeline, the system became more efficient, more cost-effective, and better aligned with scalable enterprise AI design principles.
The Challenge
Fitch was leveraging an Retrieval-Augmented Generation (RAG) system to generate insights from a large corpus of financial articles and documents. While the enterprise AI system was functional, it suffered from high LLM token usage and rapidly increasing costs, making it inefficient to operate at scale. A closer technical evaluation revealed that the issue was not the underlying language model, but how data was being processed, retrieved, and passed into it. The RAG pipeline relied on inefficient document chunking and limited precision in retrieval, which resulted in large volumes of irrelevant or low‑value context being included in each query. This led to several downstream issues: • Excessive token usage caused by oversized prompts • Retrieval of low relevance content that reduced output quality • Rising operational costs without meaningful improvements in results • An inefficient RAG pipeline design that did not prioritize retrieval precision As a result, the system became more expensive to run and less effective for a production-grade enterprise AI environment. As one key insight from the engagement highlighted, “If your retrieval is wrong, everything after it becomes expensive.”
Our Vision
The goal of the engagement was to demonstrate that we could reduce LLM operating costs and improve RAG efficiency without compromising output quality. A technical evaluation confirmed that inefficient retrieval was the primary cost driver, as unnecessary context was consistently being passed into the model. To solve this, the approach focused on optimizing how documents were segmented, retrieved, and assembled within the RAG pipeline, placing a strong emphasis on precision in retrieval. The document chunking strategy was redesigned to ensure content was broken into smaller, semantically coherent chunks, improving retrieval accuracy and relevance. Retrieval logic was then refined to prioritize high-quality, contextually-relevant matches over volume, ensuring only the most relevant information was selected for each query. In parallel, prompt construction was optimized to eliminate redundant, low-value content. This further reduced token usage while maintaining consistent output quality. Together, these improvements aligned the system with modern enterprise RAG best practices, ensuring the LLM received focused, high-signal inputs instead of large, unfocused datasets. By restructuring the flow of data through the RAG pipeline, the system became more efficient, more cost-effective, and better aligned with scalable enterprise AI design principles.
The Challenge
Fitch was leveraging an Retrieval-Augmented Generation (RAG) system to generate insights from a large corpus of financial articles and documents. While the enterprise AI system was functional, it suffered from high LLM token usage and rapidly increasing costs, making it inefficient to operate at scale. A closer technical evaluation revealed that the issue was not the underlying language model, but how data was being processed, retrieved, and passed into it. The RAG pipeline relied on inefficient document chunking and limited precision in retrieval, which resulted in large volumes of irrelevant or low‑value context being included in each query. This led to several downstream issues: • Excessive token usage caused by oversized prompts • Retrieval of low relevance content that reduced output quality • Rising operational costs without meaningful improvements in results • An inefficient RAG pipeline design that did not prioritize retrieval precision As a result, the system became more expensive to run and less effective for a production-grade enterprise AI environment. As one key insight from the engagement highlighted, “If your retrieval is wrong, everything after it becomes expensive.”
Our Vision
The goal of the engagement was to demonstrate that we could reduce LLM operating costs and improve RAG efficiency without compromising output quality. A technical evaluation confirmed that inefficient retrieval was the primary cost driver, as unnecessary context was consistently being passed into the model. To solve this, the approach focused on optimizing how documents were segmented, retrieved, and assembled within the RAG pipeline, placing a strong emphasis on precision in retrieval. The document chunking strategy was redesigned to ensure content was broken into smaller, semantically coherent chunks, improving retrieval accuracy and relevance. Retrieval logic was then refined to prioritize high-quality, contextually-relevant matches over volume, ensuring only the most relevant information was selected for each query. In parallel, prompt construction was optimized to eliminate redundant, low-value content. This further reduced token usage while maintaining consistent output quality. Together, these improvements aligned the system with modern enterprise RAG best practices, ensuring the LLM received focused, high-signal inputs instead of large, unfocused datasets. By restructuring the flow of data through the RAG pipeline, the system became more efficient, more cost-effective, and better aligned with scalable enterprise AI design principles.
Solution
We implemented targeted optimizations across the RAG pipeline, including: • Redesigned document chunking strategy to improve semantic relevance • Refined retrieval logic to increase precision in retrieval • Reduced unnecessary context passed to the LLM • Optimized prompt construction for more efficient model usage These changes improved how information was selected, structured, and delivered to the model, reducing waste while maintaining high-quality outputs.

Results: Significant Cost Reduction and Improved RAG Efficiency
The optimized RAG system delivered substantial improvements in both cost efficiency and performance. LLM token usage was reduced by approximately 66%, significantly lowering the cost of operating the enterprise AI system while maintaining output quality. By improving retrieval precision and reducing unnecessary context, the system became more efficient without requiring changes to the model itself. The improved chunking and retrieval strategy ensured that the LLM consistently received more relevant inputs, contributing to more accurate, reliable, and repeatable results. Overall, the optimized system achieved: • 66% reduction in token usage and LLM costs • Improved efficiency across the RAG pipeline • Greater precision in retrieved context • Better alignment between retrieval and generation • Improved scalability for enterprise AI workloads These results demonstrate that optimizing retrieval, chunking, and prompt construction can have a major impact on both cost and performance in enterprise RAG systems.
efficiency
& improved precision of the RAG pipeline.
66%
reduction in token usage and cost.
Results: Significant Cost Reduction and Improved RAG Efficiency
The optimized RAG system delivered substantial improvements in both cost efficiency and performance. LLM token usage was reduced by approximately 66%, significantly lowering the cost of operating the enterprise AI system while maintaining output quality. By improving retrieval precision and reducing unnecessary context, the system became more efficient without requiring changes to the model itself. The improved chunking and retrieval strategy ensured that the LLM consistently received more relevant inputs, contributing to more accurate, reliable, and repeatable results. Overall, the optimized system achieved: • 66% reduction in token usage and LLM costs • Improved efficiency across the RAG pipeline • Greater precision in retrieved context • Better alignment between retrieval and generation • Improved scalability for enterprise AI workloads These results demonstrate that optimizing retrieval, chunking, and prompt construction can have a major impact on both cost and performance in enterprise RAG systems.
efficiency
& improved precision of the RAG pipeline.
66%
reduction in token usage and cost.



