Is Fine-Tuning Worth It? My Experiment

OpenAI opened up fine-tuning for GPT-3.5-turbo a few months ago. The promise: train the model on your own data so it understands your codebase, your style, and your conventions. I spent a week testing this on a real project. The results were not what I expected.

The Setup

I have a Node.js API with about 50,000 lines of code. It follows specific patterns: a particular error handling style, a custom middleware pattern, specific database access patterns, and domain-specific naming conventions. When I use vanilla ChatGPT, I have to explain these patterns every time. I wanted a model that just knew them.

I created a training dataset of 200 examples. Each example was a prompt/completion pair: "Write a route handler for [description]" paired with the actual handler from my codebase. "Write a database query for [description]" paired with the actual query. I also included examples of our error handling pattern, middleware usage, and testing conventions.

Creating the training data was the most time-consuming part. About 6 hours of curating examples, formatting them into the required JSONL format, and making sure they were representative of the codebase patterns.

The Training

Fine-tuning through the OpenAI API is surprisingly simple. Upload the file, start the training job, wait. My 200-example dataset trained in about 20 minutes. The cost was around $5 for the training run. Not bad at all.

You get a custom model endpoint that you call the same way you'd call gpt-3.5-turbo, just with your model ID instead. Inference costs are slightly higher than the base model (about 1.5x) but still very affordable.

What Improved

The fine-tuned model genuinely learned my codebase patterns. When I asked it to write a route handler, it used my project's error handling pattern without me having to explain it. It named variables the way I name them. It used my database access layer correctly. The code it generated looked like it belonged in my codebase.

For boilerplate generation, the fine-tuned model was significantly better than the base model. Instead of generating generic Express handlers and then having me customize them, it generated handlers that matched my project's style from the start. That saved a real amount of editing time.

The model also learned my project's domain terminology. The base model would ask for clarification on domain-specific terms. The fine-tuned model understood them in context.

What Didn't Improve

Reasoning ability was unchanged. The fine-tuned model was no better at debugging complex issues, designing new features, or solving algorithmic problems. Fine-tuning teaches the model your patterns and style. It doesn't make it smarter.

When I asked it to write something that deviated from the training examples, it fell back to generic patterns. Fine-tuning made it excellent at variations of what it had seen but didn't give it a deep understanding of my architecture. It learned the surface patterns, not the underlying design principles.

The model also occasionally merged patterns from different training examples in weird ways. A route handler would use the error handling pattern from one example but the response format from another, creating Frankenstein code that was technically valid but stylistically inconsistent.

Fine-Tuning vs Better Prompting

Here's the thing that surprised me. After the fine-tuning experiment, I went back to vanilla GPT-3.5-turbo and tried a different approach: a detailed system prompt that described my codebase patterns, plus a few code examples pasted in as context. No fine-tuning at all.

The results were about 80% as good as the fine-tuned model. Not identical, but close enough that for most tasks, the difference was marginal. And the prompt-based approach has huge advantages: I can update it instantly when my patterns change, I don't need to retrain when the codebase evolves, and I can adjust the context for different types of tasks.

Fine-tuning locks in patterns at training time. If your codebase changes (and it will), you need to retrain. With prompting, you just update the prompt.

When Fine-Tuning Makes Sense

High-volume, consistent tasks. If you're making hundreds of API calls a day for the same type of generation and you need consistent style every time, fine-tuning reduces your prompt size (and thus cost) while maintaining quality. The per-request savings add up.

Specific output formats. If you need the model to always output a specific JSON structure or follow a particular template, fine-tuning is more reliable than prompting for this. The model learns the format as a default rather than following it as an instruction.

Domain-specific terminology. If your domain has lots of jargon or unconventional usage of common terms, fine-tuning helps the model understand without lengthy explanations.

When to Just Prompt Better

Most cases, honestly. If you're an individual developer using AI tools for your daily work, invest time in crafting good system prompts rather than fine-tuning. It's faster to iterate, free to experiment, and the results are close enough for most purposes.

Fine-tuning is a powerful tool but it's not magic. It teaches style, not understanding. For most developers, the effort-to-benefit ratio of better prompting beats fine-tuning. Save fine-tuning for when prompting hits a clear ceiling and you need that last 20% of quality or when your volume justifies the setup investment.