Back in March, I built a quick code review bot using the OpenAI API. It was a proof of concept. Over the summer, I turned it into a proper GitHub Action that runs on every PR. It's been running for two months now and I have real data on what works and what doesn't.
The Architecture
The GitHub Action triggers on pull_request events (opened, synchronize). It fetches the diff using the GitHub API, splits it by file, and sends each file's changes to GPT-4 with a carefully crafted system prompt. The responses come back as structured JSON with file paths, line numbers, severity levels, and comments. The action then posts these as inline PR review comments.
The whole thing is about 400 lines of TypeScript. The action runs in under 2 minutes for most PRs and costs about $0.50-1.50 per review depending on the PR size.
The System Prompt
This is the most important part and it went through about 15 iterations. The current version tells GPT-4 to focus exclusively on: bugs (logic errors, off-by-one, null reference), security issues (injection, auth bypass, data exposure), performance problems (N+1 queries, unnecessary computation, memory leaks), and error handling gaps (unhandled exceptions, missing validations).
Critically, I tell it to ignore: code style, formatting, naming conventions, and minor improvements. Without this constraint, the bot comments on everything and the signal-to-noise ratio is terrible. You want your AI reviewer to be the developer who only speaks up when something actually matters.
I also include a line that says: "If you're not confident an issue is real, don't comment. False positives are worse than missed issues." This reduced the noise significantly.
What It Catches
Over two months and roughly 60 PRs, the bot has made about 180 comments. I categorized each one:
- Genuine bugs: 23%. Missing null checks, incorrect comparison operators, promises not being awaited, array index out of bounds. These are real issues that would have caused problems.
- Security concerns: 8%. A SQL query built with string concatenation, a JWT not being verified, user input not being sanitized. Lower volume but high value.
- Valid improvements: 35%. Better error handling, performance suggestions, edge case considerations. Not bugs, but genuinely useful suggestions.
- False positives: 20%. Flagging something that's actually fine, or suggesting a change that wouldn't improve anything. These are noise.
- Obvious/redundant: 14%. Pointing out things I was already aware of or that were handled elsewhere in the code.
So about 66% of comments are useful. That's pretty good for an automated tool. The 20% false positive rate is my main area for improvement.
The GitHub Action Setup
The action is self-contained. You need an OpenAI API key stored as a repository secret. The workflow YAML is straightforward: trigger on PR events, run the action, and it handles the rest. I open-sourced the action so you can use it in your own repos.
One important detail: I set it to only review changed files, not the entire codebase. This keeps costs down and reviews focused. The downside is it can't catch issues that arise from how changed code interacts with unchanged code. That's a fundamental limitation of diff-based review.
Reducing False Positives
The biggest quality improvement came from adding a confidence threshold. In the system prompt, I ask GPT-4 to rate its confidence (1-5) for each comment. The action only posts comments rated 4 or 5. This filters out the "maybe this could be an issue" comments that are usually wrong.
I also added a file-type filter. The bot skips configuration files, lock files, migrations, and auto-generated code. These were a major source of useless comments.
Cost Analysis
Average PR review cost: about $0.80. My team opens roughly 15 PRs per week. Monthly cost: about $48. That's less than one hour of developer time, and the bot catches issues that would otherwise make it to production. The ROI is clear.
I compared GPT-4 to GPT-3.5-turbo for this use case. GPT-3.5 was about 10x cheaper per review but the quality dropped significantly. More false positives, fewer real bugs caught, and the comments were less actionable. For code review specifically, GPT-4 is worth the premium.
What It Can't Replace
This doesn't replace human code review. It's a first pass. It catches the mechanical stuff: bugs, security holes, obvious performance issues. It can't evaluate architecture decisions, assess whether a feature meets requirements, or judge whether the code is maintainable in the context of the broader system.
Think of it as a really diligent linter that understands logic, not just syntax. It makes human reviews faster because the reviewer can focus on the higher-level concerns while the bot handles the detail work.
If you're a solo developer like me, this is especially valuable. It's the closest thing to having a second pair of eyes on every PR.