I spent a month building a RAG (Retrieval Augmented Generation) system with LangChain. Then I threw it away and rebuilt it from scratch with just the OpenAI API and pgvector. The result is half the code, twice as fast, and actually debuggable. Here's why and how.

Why I ditched LangChain

LangChain is a popular framework for building LLM applications. It has abstractions for everything: document loaders, text splitters, vector stores, retrievers, chains. The problem is that these abstractions add complexity without adding proportional value for most use cases.

When my RAG system returned a bad answer, I had to trace through five layers of abstraction to figure out whether the problem was in chunking, embedding, retrieval, or generation. The debugging experience was miserable. Stack traces pointed to internal LangChain code. Configuration was spread across multiple chain objects. Simple changes required understanding the framework's class hierarchy.

The final straw was a bug where the retriever was returning irrelevant chunks. The fix was a one-line change to a similarity threshold, but finding where to make that change in LangChain's abstraction layers took me two hours. In my from-scratch version, it's a parameter in a SQL query that I can see and change directly.

The architecture

My RAG system has four components. That's it. A document processor that chunks text. An embedding function that calls OpenAI's embedding API. A PostgreSQL database with pgvector for storage and retrieval. A query function that retrieves relevant chunks and passes them to GPT-4 with a prompt.

No framework. No abstraction layers. Each component is a single file with a clear, testable interface. Total code: about 300 lines of Python.

Document processing

Chunking strategy matters more than most tutorials suggest. I split documents into chunks of 512 tokens with 50 tokens of overlap. The overlap ensures that concepts spanning chunk boundaries aren't lost. I experimented with 256, 512, and 1024 token chunks. 512 was the sweet spot for my use case (technical documentation). Smaller chunks retrieved more precisely but lost context. Larger chunks provided more context but diluted relevance.

Each chunk stores metadata: source document, section heading, position in document. This metadata is crucial for providing citations in the final response and for debugging retrieval quality.

Embeddings and storage

I use OpenAI's text-embedding-3-small model. It's cheap ($0.02 per million tokens), fast, and produces 1536-dimensional vectors that work well with pgvector. I embed each chunk and store the vector alongside the text and metadata in PostgreSQL.

pgvector handles the similarity search natively. A query like SELECT * FROM chunks ORDER BY embedding <=> $1 LIMIT 5 returns the five most similar chunks using cosine distance. No separate vector database needed. No additional infrastructure to manage. If you're already running PostgreSQL, adding pgvector is a single CREATE EXTENSION statement.

Retrieval and generation

When a user asks a question, I embed the question using the same embedding model, retrieve the top 5 most similar chunks from pgvector, and construct a prompt that includes the retrieved context plus the user's question. The prompt template is straightforward:

Answer the question based on the provided context.
If the context doesn't contain enough information,
say so. Don't make up information.

Context:
{retrieved_chunks}

Question: {user_question}

The instruction to not make up information is critical. Without it, the model will happily generate plausible-sounding answers that aren't supported by your documents, which defeats the entire purpose of RAG.

What I learned

Retrieval quality is 80% of RAG quality. If you're retrieving the wrong chunks, no amount of prompt engineering will save you. Invest your time in chunking strategy, embedding quality, and retrieval tuning before you start tweaking the generation prompt.

Hybrid search (combining vector similarity with keyword matching) significantly improves results. I added a simple full-text search filter that requires at least one keyword from the query to appear in the retrieved chunk. This eliminated most irrelevant retrievals.

You don't need a framework for this. The total API surface you need is: one embedding endpoint, one chat completion endpoint, and basic SQL. If you understand these three things, you can build a production-quality RAG system in a weekend.

When to use LangChain anyway

If you're prototyping quickly and don't care about debugging. If you need complex chains with multiple LLM calls, tool use, and agent-like behavior. If your team already knows LangChain well. The framework has its place. But for a focused RAG system, it's unnecessary overhead. Build it yourself, understand every line, and you'll have something you can actually maintain and improve.