I built a thing. A code review bot that reads GitHub pull requests and leaves comments with suggestions. It took a weekend. And it taught me more about the OpenAI API than weeks of reading documentation would have.
The Idea
I wanted something practical. Not a chatbot wrapper, not a summarizer. Something that solves a real problem I have. As a solo developer, I don't have anyone reviewing my code. I wanted an AI that could look at my PRs and catch obvious issues, suggest improvements, and flag potential bugs.
Architecture
The setup is simple. A GitHub webhook fires when a PR is opened or updated. It hits a small Express server I'm running. The server fetches the diff, splits it into chunks (more on why in a second), sends each chunk to the OpenAI API with a system prompt that says "you are a senior code reviewer," and then posts the responses as PR comments using the GitHub API.
Total stack: Node.js, Express, the OpenAI Node SDK, and the Octokit GitHub library. Maybe 300 lines of actual code.
The Token Limit Problem
This is the first real surprise. GPT-3.5-turbo has a 4K token context window. A large PR diff can easily be 10K-20K tokens. You can't just dump the whole diff into a single API call.
My solution: split the diff by file, then split large files into chunks of roughly 3K tokens (leaving room for the system prompt and response). Each chunk gets its own API call. This means the bot reviews each file independently, which is actually fine for most code review purposes.
Later I learned about the 16K context version of GPT-3.5-turbo. That would simplify things a lot. But chunking is still a useful pattern to understand because even 16K isn't enough for very large PRs.
The System Prompt
This is where the magic happens. My first attempt was vague: "Review this code and suggest improvements." The results were generic, mentioning things like "consider adding error handling" on every single chunk.
My final system prompt is much more specific. It tells the model to focus on bugs, security issues, and performance problems. It tells it to ignore style preferences and formatting. It tells it to only comment when there's a genuine issue, not just to fill space. And it tells it to reference specific line numbers from the diff.
The difference in output quality between a lazy system prompt and a carefully crafted one is enormous. Prompt engineering isn't hype. It's a real skill and it matters.
Costs
This was the pleasant surprise. GPT-3.5-turbo is cheap. Like, really cheap. My average PR generates maybe 5-8 API calls. Each call costs around $0.002-0.005 depending on the chunk size. A full PR review costs about 2-3 cents.
I review maybe 3-4 PRs a day. Monthly cost: under $3. I was prepared to spend $50-100/month and budgeted accordingly. The actual cost is basically a rounding error.
I tested with GPT-4 too. The reviews were noticeably better, catching subtler issues and providing more contextual suggestions. But the cost jumped to about $0.50-1.00 per PR review. For a solo developer, that's still affordable. For a team running this on every PR, it adds up fast.
What It Actually Catches
In two weeks of use, it's caught: a missing null check that would have caused a crash in production, an SQL query that was vulnerable to injection (I was building the query string manually like an idiot), unused imports, a promise that wasn't being awaited, and several variable naming suggestions that were genuinely better than what I had.
What it doesn't catch: business logic errors, architectural issues, or anything that requires understanding the broader system context. It reviews code in isolation, which means it can't tell you "this approach conflicts with how the rest of the system works."
Lessons Learned
The OpenAI API is surprisingly easy to work with. If you've used any REST API before, you can build something useful in a day. The hard part isn't the integration. It's figuring out the right prompts, handling token limits gracefully, and building good error handling for when the API is slow or returns garbage.
Also, rate limits are real. The free tier is very restrictive. Even the paid tier has limits that you'll hit if you're making many parallel calls. Build in retry logic with exponential backoff from day one.
GPT-3.5-turbo is the sweet spot for most applications. It's fast, cheap, and good enough. Start there and only upgrade to GPT-4 if you have a specific quality gap that justifies the 20x cost increase.
I'm hooked. This was my first AI-powered project and it definitely won't be my last.