EDUBERROCAL.NET

My Experience Building a RAG System

28-August-2025

About a month ago, I gave an internal presentation to my team about agentic AI. The presentation, which was general and introductory in nature, had the goal to get the team to a point where we could discuss the merits of the technology and evaluate how it cloud theoretically help us in any of our projects.

Very quickly, the juices started flowing. Apart from the commonplaces that discussions about AI always elicit--for example, a future without jobs, or whether AI is just a very loquacious parrot--the team quickly decided that we needed to build a proof-of-concept(POC). Among the ideas, the most interesting was a Retrieval-augmented generation (RAG) system.

RAG is just a fancy way of searching for information. In addition to searching for the documents where the information you are looking for may be, you also use AI to answer the question, using both what the AI model "knows" and the documents (more like document pieces) retrieved during search. The benefits I see here are two: (1) you can use more human-like queries, including adding more information or context, and (2) if the AI response already provides the answer you are looking for, you avoid having to actually go into the documents themselves and read through them to find the answer.

For this POC, we used a lot of internal Intel Wiki documents. In the first step, we retrieved and vectorized the data, where the documents were split into chunks of 200 words with a 10% overlap between them. The vectors were created using the nomic-embed-text-v1.5 model, which uses Matryoshka Representation Learning (MRL). Given the nature of the project, the vectors were just dumped into a Mongo database, although ideally, a vector database is a better fit.

To find the best chunks for a particular query, the query is vectorized (using the same model) and compared with the stored chunks using cosine similarity. In my case, I used the top 40 chunks (or vectors). Finally, the query--along with the top chunks--is sent to the LLM as a user prompt with the following system prompt:

'You are a Reasoning-Augmented Generation (RAG) Agent. Your job is to use the context I pass you, along with other information you know, to answer the questions I ask you. If the answer is not in the context, try to answer it with what you know only, but please tell me that you did not use the context when you do not use it. Each text chunk in the context is prefixed by a document id like this [1234]. Along with the answer, I want you to also tell me (at the end, when you have finished your answer) the document ids of the chunks you have used to answer the question if you have use any. I want this part of the answer to be exactly like this: "Wiki Documents Used:"'.

Honestly, I wasn't expecting this to work as well as it did. I thought we would need to do more iterations and play a bit more with the different parameters for both the vectorization process and the LLM. But I was wrong. This worked out of the box, providing useful answers from many Wiki documents and helping find information quickly. We also did another iteration of this POC where, instead of Wiki documents, we used business process data in the form of graphs (coded in JSON). Again, I was expecting this experiment to fail miserably, since the data wasn't free-form text this time but structured. We did the same thing (200-word chunks, 10% overlap, etc.) and, to our surprise, the agent could answer questions! This meant that, without giving the model any context about the format/structure of the graph data, it was able to understand how different parts of graphs were connected and how the information flowed through them.