16 essential AI design patterns — explained with clear diagrams, real Spring Boot microservice code, and real project scenarios. Understand how each pattern works, when to use it, and what to avoid.
Fetches relevant documents from a knowledge base before generating an answer — grounding the LLM in real data instead of potentially hallucinated training memory.
"I'm getting a NullPointerException when the LLM returns an empty response. How do I fix this and make my AI integration production-safe?"
RAG — Retrieval-Augmented Generation — solves the core limitation of LLMs: they only know what they were trained on. Training data has a cutoff date, doesn't include your private documents, and can be factually wrong on niche topics. RAG fixes this by making the LLM "look things up" before it answers.
Think of it like giving a doctor access to a live medical database before every patient consultation. Without RAG, the doctor answers from memory — potentially outdated or wrong. With RAG, the doctor reads the latest reference, then answers — grounded in real, current facts. That is exactly what happens in your Spring Boot app.
In technical terms: RAG converts the user question into a vector embedding, searches a pre-indexed vector database for semantically similar documents, injects those documents into the LLM prompt as context, and lets the LLM generate an answer grounded in retrieved data — not training memory.
Any system where the LLM needs access to private, recent, or domain-specific knowledge — customer support, internal docs Q&A, HR policy chatbots, legal research, medical reference, technical support.
The foundation. Embed the question → vector search → inject top 4 docs → LLM answers. Works for most questions. Limitation: misses exact technical terms like "NullPointerException".
Combines vector search (semantic meaning) + BM25 keyword search (exact terms) + metadata filtering. Score-fused with Reciprocal Rank Fusion. Best for: technical docs with specific jargon — catches both semantic meaning AND exact class names.
Generates 4 differently-phrased versions of the question, searches with all of them in parallel, deduplicates results. Docs found by multiple phrasings score higher. Best for: vague questions where one phrasing misses relevant docs.
Asks the LLM to imagine a perfect answer first, then uses that richer hypothetical answer as the search vector. A 4-word question embeds poorly; a 200-word imagined answer embeds far better. Best for: short or vague queries.
Indexes small chunks (precise search) but returns the full parent section (rich context). Like using a book index to find a page, then reading the full chapter. Best for: long docs where a 2-sentence snippet lacks the surrounding code examples.
After retrieving docs, asks the LLM to extract ONLY the sentences that directly answer the question. Removes noise, reduces token cost by 70%+. Best for: cost reduction and sharper, focused answers.
Retrieves 20 candidates (wide net), then uses a cross-encoder model (Cohere Rerank) to re-score each doc specifically against the question. Vector similarity does not equal usefulness — re-ranking finds what is actually most relevant. Best for: high-stakes answers where quality matters most.
The LLM decides what to search, when to search again, and when it has enough to answer. Uses multiple tools: searchDocs, searchCodeExamples, searchGitHubIssues. Loops until confident. Best for: complex multi-part questions requiring several knowledge sources.