Retrieval-Augmented Generation (RAG)
Retrieval-augmented generation (RAG) is a technique that lets an AI look up relevant information from an outside source — like a company's documents or a database — and use what it finds to answer your question, instead of relying only on what it memorized during training.
What is Retrieval-Augmented Generation (RAG)?
A large language model answers from memory. Everything it "knows" was baked in when it was trained, which creates two stubborn problems: it has no knowledge of anything that happened after training finished, and it knows nothing about your private information — your company's internal policies, your product manuals, last week's support tickets. Worse, when it doesn't know something, it often doesn't say so; it produces a confident, plausible-sounding answer that may simply be wrong. Retrieval-augmented generation is the most common fix. Before the model writes its answer, the system first retrieves the most relevant pieces of text from a source you control, hands them to the model along with your question, and asks it to answer using that material. The model still does the writing, but now it is working from supplied facts rather than memory alone.
Your documents are broken into chunks and stored in a way that lets the system find passages by meaning rather than exact keywords — an approach called semantic search, powered by embeddings and a vector database, which together let a computer judge how closely two pieces of text are related in topic even when they share no words in common. This is what separates RAG from an old-fashioned keyword search: it can match your question to the right passage even when the wording is completely different. When a question comes in, the system grabs the handful of chunks most relevant to it and slots them into the model's input. So a question about your refund policy pulls in the actual paragraphs from your policy document, and the model answers from those. Because the source text is right there, many RAG systems can also point back to exactly which document the answer came from — something a plain language model cannot do.
RAG has become popular because it is a practical, relatively affordable way to make a general-purpose model useful on specific, current, or private information without the expense of retraining it. It noticeably reduces hallucination by grounding answers in real source material, and when the underlying information changes you just update the documents rather than rebuilding the model. It is not magic, though: if the retrieval step pulls the wrong passages, or the source documents themselves are wrong or out of date, the answer will be too. The quality of a RAG system depends as much on the library it draws from and how well it finds the right passage as on the language model doing the talking.
Real-world example
A new hire messages the company's internal AI assistant: "How many vacation days do I get, and can I carry unused ones into next year?" A plain language model would have to guess, since it was never trained on that particular company's rules — and it might guess wrong while sounding completely sure. A RAG-based assistant instead searches the company handbook, pulls the two paragraphs that actually cover leave entitlement and carryover, and answers from those — often adding "according to the employee handbook, section 4" so the person can check for themselves. Same friendly chat experience, but the answer is anchored to the company's real policy rather than the model's best guess.
Related terms
Frequently asked questions
What is the difference between RAG and fine-tuning?
Both make a general model more useful for your needs, but in different ways. Fine-tuning adjusts the model itself by training it further on your data, changing its internal wiring — good for teaching a consistent style or skill, but slow and costly to redo every time your information changes. RAG leaves the model untouched and instead feeds it the right reference material at the moment you ask. The rule of thumb: fine-tuning changes how the model behaves; RAG changes what facts it has in front of it. Many real systems use both.
Does RAG stop AI hallucinations?
It reduces them, but does not eliminate them. By grounding answers in retrieved source text, RAG gives the model real facts to work from instead of leaving it to invent plausible-sounding ones, which cuts down on confident errors considerably. But the model can still misread a passage, blend sources clumsily, or fall back on its own memory — and if the retrieval step fetches the wrong material, the answer suffers. RAG makes hallucination less likely, not impossible.
Why use RAG instead of a bigger or newer model?
Because size and freshness don't solve the core problem. Even the largest, most recent model still has a training cutoff and still knows nothing about your private documents. RAG is how you give any model access to current and proprietary information without retraining it, and it lets you update what the system knows just by editing the underlying documents. It is also usually far cheaper than training a bespoke model, which is a big part of why so many real-world AI products rely on it.