Case Study

Reducing LLM Costs by 66% Through RAG Pipeline Optimization

Our team at Roko Labs helped optimize an RAG-based enterprise AI system at Fitch Ratings by improving document chunking, retrieval precision, and prompt construction within the RAG pipeline. These changes reduced unnecessary context passed to the LLM, resulting in a 66% reduction in token usage and operational costs while maintaining output quality and improving overall system efficiency.

Case Study

Reducing LLM Costs by 66% Through RAG Pipeline Optimization

Our team at Roko Labs helped optimize an RAG-based enterprise AI system at Fitch Ratings by improving document chunking, retrieval precision, and prompt construction within the RAG pipeline. These changes reduced unnecessary context passed to the LLM, resulting in a 66% reduction in token usage and operational costs while maintaining output quality and improving overall system efficiency.

Enterprise RAG architecture illustrating precision in retrieval, semantic document chunking, and optimized prompts for lowering LLM costs

Client

Fitch Ratings

https://www.fitchratings.com/

Industry

Financial Services

Services

AI Due Diligence

Project Duration

<1 month

Client

Fitch Ratings

https://www.fitchratings.com/

Industry

Financial Services

Services

AI Due Diligence

Project Duration

<1 month

Client

Fitch Ratings

https://www.fitchratings.com/

Industry

Financial Services

Services

AI Due Diligence

Project Duration

<1 month

The Challenge

Fitch was leveraging an Retrieval-Augmented Generation (RAG) system to generate insights from a large corpus of financial articles and documents. While the enterprise AI system was functional, it suffered from high LLM token usage and rapidly increasing costs, making it inefficient to operate at scale. A closer technical evaluation revealed that the issue was not the underlying language model, but how data was being processed, retrieved, and passed into it. The RAG pipeline relied on inefficient document chunking and limited precision in retrieval, which resulted in large volumes of irrelevant or low‑value context being included in each query. This led to several downstream issues: • Excessive token usage caused by oversized prompts • Retrieval of low relevance content that reduced output quality • Rising operational costs without meaningful improvements in results • An inefficient RAG pipeline design that did not prioritize retrieval precision As a result, the system became more expensive to run and less effective for a production-grade enterprise AI environment. As one key insight from the engagement highlighted, “If your retrieval is wrong, everything after it becomes expensive.”

Our Vision

The goal of the engagement was to demonstrate that we could reduce LLM operating costs and improve RAG efficiency without compromising output quality. A technical evaluation confirmed that inefficient retrieval was the primary cost driver, as unnecessary context was consistently being passed into the model. To solve this, the approach focused on optimizing how documents were segmented, retrieved, and assembled within the RAG pipeline, placing a strong emphasis on precision in retrieval. The document chunking strategy was redesigned to ensure content was broken into smaller, semantically coherent chunks, improving retrieval accuracy and relevance. Retrieval logic was then refined to prioritize high-quality, contextually-relevant matches over volume, ensuring only the most relevant information was selected for each query. In parallel, prompt construction was optimized to eliminate redundant, low-value content. This further reduced token usage while maintaining consistent output quality. Together, these improvements aligned the system with modern enterprise RAG best practices, ensuring the LLM received focused, high-signal inputs instead of large, unfocused datasets. By restructuring the flow of data through the RAG pipeline, the system became more efficient, more cost-effective, and better aligned with scalable enterprise AI design principles.

The Challenge

Fitch was leveraging an Retrieval-Augmented Generation (RAG) system to generate insights from a large corpus of financial articles and documents. While the enterprise AI system was functional, it suffered from high LLM token usage and rapidly increasing costs, making it inefficient to operate at scale. A closer technical evaluation revealed that the issue was not the underlying language model, but how data was being processed, retrieved, and passed into it. The RAG pipeline relied on inefficient document chunking and limited precision in retrieval, which resulted in large volumes of irrelevant or low‑value context being included in each query. This led to several downstream issues: • Excessive token usage caused by oversized prompts • Retrieval of low relevance content that reduced output quality • Rising operational costs without meaningful improvements in results • An inefficient RAG pipeline design that did not prioritize retrieval precision As a result, the system became more expensive to run and less effective for a production-grade enterprise AI environment. As one key insight from the engagement highlighted, “If your retrieval is wrong, everything after it becomes expensive.”

Our Vision

The goal of the engagement was to demonstrate that we could reduce LLM operating costs and improve RAG efficiency without compromising output quality. A technical evaluation confirmed that inefficient retrieval was the primary cost driver, as unnecessary context was consistently being passed into the model. To solve this, the approach focused on optimizing how documents were segmented, retrieved, and assembled within the RAG pipeline, placing a strong emphasis on precision in retrieval. The document chunking strategy was redesigned to ensure content was broken into smaller, semantically coherent chunks, improving retrieval accuracy and relevance. Retrieval logic was then refined to prioritize high-quality, contextually-relevant matches over volume, ensuring only the most relevant information was selected for each query. In parallel, prompt construction was optimized to eliminate redundant, low-value content. This further reduced token usage while maintaining consistent output quality. Together, these improvements aligned the system with modern enterprise RAG best practices, ensuring the LLM received focused, high-signal inputs instead of large, unfocused datasets. By restructuring the flow of data through the RAG pipeline, the system became more efficient, more cost-effective, and better aligned with scalable enterprise AI design principles.

The Challenge

Fitch was leveraging an Retrieval-Augmented Generation (RAG) system to generate insights from a large corpus of financial articles and documents. While the enterprise AI system was functional, it suffered from high LLM token usage and rapidly increasing costs, making it inefficient to operate at scale. A closer technical evaluation revealed that the issue was not the underlying language model, but how data was being processed, retrieved, and passed into it. The RAG pipeline relied on inefficient document chunking and limited precision in retrieval, which resulted in large volumes of irrelevant or low‑value context being included in each query. This led to several downstream issues: • Excessive token usage caused by oversized prompts • Retrieval of low relevance content that reduced output quality • Rising operational costs without meaningful improvements in results • An inefficient RAG pipeline design that did not prioritize retrieval precision As a result, the system became more expensive to run and less effective for a production-grade enterprise AI environment. As one key insight from the engagement highlighted, “If your retrieval is wrong, everything after it becomes expensive.”

Our Vision

The goal of the engagement was to demonstrate that we could reduce LLM operating costs and improve RAG efficiency without compromising output quality. A technical evaluation confirmed that inefficient retrieval was the primary cost driver, as unnecessary context was consistently being passed into the model. To solve this, the approach focused on optimizing how documents were segmented, retrieved, and assembled within the RAG pipeline, placing a strong emphasis on precision in retrieval. The document chunking strategy was redesigned to ensure content was broken into smaller, semantically coherent chunks, improving retrieval accuracy and relevance. Retrieval logic was then refined to prioritize high-quality, contextually-relevant matches over volume, ensuring only the most relevant information was selected for each query. In parallel, prompt construction was optimized to eliminate redundant, low-value content. This further reduced token usage while maintaining consistent output quality. Together, these improvements aligned the system with modern enterprise RAG best practices, ensuring the LLM received focused, high-signal inputs instead of large, unfocused datasets. By restructuring the flow of data through the RAG pipeline, the system became more efficient, more cost-effective, and better aligned with scalable enterprise AI design principles.

RAG pipeline before and after optimization diagram highlighting precision in retrieval, semantic chunks, and efficient pipeline design to reduce LLM token usage

Reducing LLM Costs by 66% Through RAG Pipeline Optimization

Reducing LLM Costs by 66% Through RAG Pipeline Optimization

The Challenge

Our Vision

The Challenge

Our Vision

The Challenge

Our Vision

Solution

Results: Significant Cost Reduction and Improved RAG Efficiency

efficiency

66%

Results: Significant Cost Reduction and Improved RAG Efficiency

efficiency

66%

Need an overhaul of your AI systems? Our engineers can help.

More Case Studies

Simplifying Alternative Investments Valuation Through Modern UX & Design

Simplifying Alternative Investments Valuation Through Modern UX & Design

Digital Transformation for the Commercial Construction Industry

Digital Transformation for the Commercial Construction Industry

Optimizing Investigational New Drug Application Processes

Have a similar task or project? Let's talk about it!

Have a similar task or project? Let's talk about it!