Every week someone DMs me asking how to "build their own AI model." Most of them shouldn't. I don't mean that in a gatekeeping way. I mean that what they actually want - a chatbot that knows their data, an app that generates specific outputs, a tool that writes in their style - doesn't require building a model at all. They need an API call. Maybe some fine-tuning. Almost never training from scratch.
I've shipped all four approaches in production over the past two years. API wrappers, RAG pipelines, fine-tuned models, and yes, one painful attempt at pre-training a small model that I abandoned after burning $4,200 in cloud GPU credits. Here's what I actually recommend as of April 2026, with real costs, real tools, and zero hand-waving.
Path 1: Don't Build (Just Use APIs)
This is the right answer for about 90% of people reading this article. It's not the exciting answer. I know. You want to train something. But hear me out.
As of this month, the best frontier models are available through simple API calls. Claude Opus 4.6 at $15/$75 per million tokens. GPT-5.4 at similar pricing. Claude Sonnet 4.6 at $3/$15 if you want something cheaper and faster. Gemini 3.1 Pro at $2/$12 with aggressive caching discounts. These models are better than anything you could train yourself unless you're spending eight figures on compute.
Here's a complete Claude API call in Python. This is your "AI model":
import anthropic
client = anthropic.Anthropic()
message = client.messages.create(
model="claude-sonnet-4-6-20260401",
max_tokens=1024,
messages=[{"role": "user", "content": "Explain this error: " + error_msg}]
)
print(message.content[0].text)
Five lines. That's it. You now have access to a model that scores higher than any fine-tune you could make on most benchmarks. Wrap this in a Flask endpoint and you've got an AI-powered app.
When API calls are the right choice: prototyping, MVPs, apps with under 100K daily users, any situation where you don't have proprietary training data that would actually improve the model. I wrote more about the tools I pay for in my AI tools breakdown.
When API calls break down: data privacy (you can't send medical records to OpenAI), strict latency requirements (API calls add 200-500ms network overhead), or cost at scale. That last one is real. I had a client last month whose OpenAI API bill hit $8,400/month. At that point, you should be looking at open-source models running on your own hardware.
But for most people? Just use the API. The model is someone else's problem. You focus on the product.
Path 2: Augment with RAG (Your Data, Their Model)
RAG - Retrieval Augmented Generation - is the answer when people say "I want the AI to know my data." You don't train anything. You take your documents, chop them into chunks, embed them as vectors, store them in a vector database, and then when a user asks a question, you fetch the relevant chunks and pass them to a frontier model as context. The model reads your data at query time.
I built a detailed walkthrough of this in my RAG from scratch guide, so I won't repeat the full code here. But the high-level pipeline looks like this:
# 1. Chunk your documents
chunks = split_text(document, chunk_size=512, overlap=50)
# 2. Embed and store in vector DB
for chunk in chunks:
embedding = openai.embeddings.create(model="text-embedding-3-small", input=chunk)
db.insert(chunk, embedding)
# 3. At query time: embed the question, retrieve similar chunks
query_embedding = embed(user_question)
relevant_chunks = db.similarity_search(query_embedding, top_k=5)
# 4. Pass chunks + question to the LLM
response = llm.generate(context=relevant_chunks, question=user_question)
When RAG is right: internal knowledge bases, customer support bots, anything where the model needs to reference specific documents. Legal, medical, financial - any domain where the answers live in your own corpus. A friend of mine runs a RAG system over 40,000 internal wiki pages for a 200-person company. It replaced their terrible Confluence search overnight.
The tools, as of April 2026:
- Embedding models: OpenAI's text-embedding-3-small ($0.02/M tokens), Cohere's embed-v4, or open-source models like BGE-M3 (free, run locally)
- Vector databases: pgvector (free, just a Postgres extension), ChromaDB (free, local), Pinecone ($0-70/month depending on scale), Weaviate
- Frameworks: LlamaIndex, LangChain (I prefer LlamaIndex - less abstraction overhead)
Total cost for a basic RAG system: embedding model ($0.02-0.10/M tokens) + vector DB ($0-50/month) + inference model ($3-15/M tokens). For a small internal tool, you're looking at $20-80/month total. Not bad.
Here's the trap though. RAG is extremely easy to demo and extremely hard to make reliable. Your chunking strategy matters more than your model choice. Chunk too small and you lose context. Chunk too big and you dilute relevance with noise. I've spent more hours debugging chunking logic than I'd like to admit. And the retrieval quality problem - where the system pulls back irrelevant chunks - is where most RAG projects silently fail. The demo works great on 10 test questions. Then real users ask things you didn't anticipate and the whole thing falls apart.
Still, for the cost and complexity, RAG is the best bang-for-buck approach when you have proprietary data. Don't skip it and jump straight to fine-tuning.
Path 3: Fine-Tune an Existing Model
Fine-tuning means taking a pre-trained open-source model and training it further on your data. You're not building a model from zero. You're taking something like Qwen 3.6 or Llama 4 and teaching it your specific domain, style, or output format. Think of it as the difference between hiring a smart generalist and training them on your company's processes.
I wrote about whether fine-tuning is worth it back in 2023 when GPT-3.5 fine-tuning launched. The answer has changed a lot since then. Back then the base models were weaker, so fine-tuning gave you a bigger lift. Now the base models are so good that fine-tuning is more about specialization than improvement.
When fine-tuning makes sense in 2026:
- You need a specific output voice, format, or style that prompting can't reliably produce
- You have domain expertise that isn't well-represented in the base model's training data
- You're spending over $5K/month on API calls and a self-hosted fine-tuned model would be cheaper
- Data privacy - you can't send data to third-party APIs
Base models worth considering right now (April 2026):
- Qwen 3.6-35B-A3B - just shipped this week. 35B total parameters but only 3B active (mixture of experts), so it runs fast on modest hardware. Apache 2.0 license. This is my current pick for fine-tuning projects.
- Llama 4 Scout - Meta's latest. 10M token context window. Open weights. Good all-around base, though the licensing is more restrictive than Qwen.
- Mistral models - Mistral Small and Medium are solid mid-range options. Good if you need European data compliance.
- Gemma 4 - Google's open models. Smaller (2B-27B), good for edge deployment or when you need something lean.
I keep an updated comparison of these in my April 2026 model rankings and you can check BenchGecko if you want to compare inference costs across providers before committing.
The fine-tuning tools:
- Unsloth - the fastest LoRA fine-tuning library right now. Free tier works on Google Colab. I've used this for 3 fine-tuning projects and it's absurdly fast. 2x faster than the alternatives on the same hardware.
- HuggingFace TRL - the standard library. More flexible, more verbose. Good when you need fine-grained control.
- Axolotl - config-driven fine-tuning. You write a YAML file instead of code. Nice for teams.
- together.ai fine-tuning API - upload your data, click a button. ~$2-5 per million training tokens. Zero GPU management.
Cost reality: An A100 GPU on RunPod or Lambda costs $1.50-2.00/hour. A decent LoRA fine-tune on a 7B-35B model takes 2-8 hours depending on dataset size. So you're looking at $3-16 per experiment. That's it. Not thousands. Not millions. Three to sixteen dollars. You can also do it for free on Google Colab with Unsloth if your dataset is small enough (under 1000 examples, and you're patient with the slower T4 GPU).
Here's a minimal fine-tuning script with Unsloth:
from unsloth import FastLanguageModel
from trl import SFTTrainer
from datasets import load_dataset
# Load base model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen3.6-35B-A3B-bnb-4bit",
max_seq_length=2048,
load_in_4bit=True
)
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(model, r=16, lora_alpha=16)
# Train on your dataset
trainer = SFTTrainer(
model=model,
train_dataset=load_dataset("json", data_files="my_data.jsonl")["train"],
max_seq_length=2048,
num_train_epochs=3,
)
trainer.train()
model.save_pretrained("my-fine-tuned-model")
That's a working fine-tuning script. About 15 lines. The hard part isn't the code - it's the dataset.
LoRA vs full fine-tuning: LoRA (Low-Rank Adaptation) freezes most of the model's weights and only trains small adapter layers. It gives you 90-95% of the quality of full fine-tuning at roughly 5% of the compute cost. Always start with LoRA. I haven't done a full fine-tune in over a year because LoRA just works well enough.
Dataset size: 500-5000 high-quality examples is the sweet spot. I ran experiments last month with 200, 500, 1000, and 5000 examples on the same task. 200 was too few - the model overfit to the examples and got brittle. 500 was the inflection point where quality jumped. Going from 1000 to 5000 gave diminishing returns. Quality of examples beats quantity every time. 500 carefully curated examples will outperform 10,000 sloppy ones.
The trap with fine-tuning: garbage in, garbage out. If your training data has inconsistent formatting, factual errors, or sloppy examples, your fine-tuned model will faithfully reproduce all of that garbage. It amplifies whatever you feed it. I've seen people fine-tune on GPT-generated synthetic data and end up with a model that's strictly worse than just calling GPT directly. Don't do that. Use real data. Curate it manually. It's tedious but it's the whole game.
Path 4: Train From Scratch (You Probably Shouldn't)
I'm including this section because people keep asking. But I'll be blunt: if you're reading a blog post about how to build an AI model, training from scratch is not for you. It's not for me either. I tried it once and regret it.
Training a language model from scratch requires: a massive dataset (trillions of tokens), a GPU cluster (thousands of A100s or H100s running for weeks), a team of ML engineers who understand distributed training, and a budget measured in millions. GPT-4's training run reportedly cost around $100M. Claude's training runs are in a similar range. Even a "small" 1B parameter model trained from scratch will cost $10K-50K in compute and take days on a multi-GPU setup.
The only legitimate reasons to train from scratch:
- You're a research lab exploring new architectures
- You have a dataset in a language or domain so niche that no existing model covers it (think: a specific programming language that didn't exist when current models were trained)
- You need an architecture that's structurally different from transformers
If you just want to learn how training works - which is a valid goal - use Andrej Karpathy's nanoGPT. It trains a tiny GPT-2 scale model on Shakespeare or whatever corpus you give it. Great for understanding the mechanics. Terrible for anything production.
# Train a small GPT on Shakespeare with nanoGPT
python data/shakespeare_char/prepare.py
python train.py config/train_shakespeare_char.py
Two commands. You'll have a character-level model that generates vaguely Shakespeare-sounding text in about 15 minutes on a single GPU. It's a toy. But it's a toy that teaches you everything about tokenization, attention, training loops, and loss curves. I learned more from spending a weekend with nanoGPT than from any ML course I took.
If you insist on going further, the modern starting point is the GPT-NeoX or Megatron-LM frameworks for distributed training. But you need at least 8 A100s to do anything meaningful, and even then you're training models that are worse than open-source alternatives you could just download for free.
Save your money. Fine-tune.
Running Your Model in Production
Okay, so you've fine-tuned a model or you're running an open-source base model. Now you need to serve it. This is where a lot of tutorials stop and a lot of engineers get stuck.
Inference engines (pick one):
- vLLM - the fastest open-source inference engine. PagedAttention, continuous batching, tensor parallelism. If you're serving more than a few hundred requests per day, use vLLM. I wrote about local model serving in my Ollama guide, but vLLM is what you want for production.
- Ollama - the easiest to set up.
ollama run qwen3.6and you're serving a model locally. Great for development, not ideal for high-traffic production. - llama.cpp - CPU inference using GGUF quantized models. Surprisingly fast for smaller models. Good when you don't have GPU access.
- HuggingFace TGI - Text Generation Inference. Solid middle ground between vLLM and Ollama.
Where to host:
- RunPod - $0.39/hr for an A40 GPU. Best value for inference right now. I've been using them since January.
- Lambda - $1.10/hr for an A100. More expensive but more reliable. Better for production workloads.
- together.ai inference API - managed hosting, you pay per token. No server management. Good if you don't want to deal with infrastructure.
- Your own machine - if you have a 24GB+ GPU (RTX 4090, for example), you can serve a 7B-13B model locally for zero ongoing cost
My rule of thumb: if you're serving under 1,000 requests per day, just use a managed API (together.ai, Fireworks, or even the original model provider). The infrastructure overhead of self-hosting isn't worth it at that scale. If you're over 10,000 requests per day, self-hosting on RunPod or Lambda will save you significant money versus paying per-token API prices.
The middle ground (1K-10K req/day) depends on your model size and latency requirements. I've seen the code quality comparison between API-hosted and self-hosted models be identical when you use the same model weights - the difference is pure infrastructure.
Mohit's Decision Tree
After two years of doing this, here's my actual decision process when starting a new AI project:
Step 1: Do you need custom behavior beyond what prompting can do?
No - Use API calls. Path 1. Stop here. Seriously. Write a good system prompt, use Claude or GPT, and build your product. Don't overthink it.
Step 2: Is your proprietary data the differentiator?
Yes, and the base model's reasoning is fine - Use RAG. Path 2. Your data goes into a vector database, the model reads it at query time. No training needed.
Step 3: Do you need a specific voice, format, or domain expertise baked into the model?
Yes - Fine-tune. Path 3. Pick Qwen 3.6 or Llama 4 as your base, collect 500-2000 examples, run LoRA fine-tuning with Unsloth, deploy on RunPod or together.ai.
Step 4: None of the above work?
Are you sure? Go back and try harder with Paths 1-3. In my experience, one of those three solves every problem I've encountered. If you still think you need to train from scratch, you're probably either doing cutting-edge research (in which case you don't need this blog post) or you're underestimating what fine-tuning can do.
My current split across all projects: roughly 70% API calls, 20% RAG, 10% fine-tuned models, 0% training from scratch. The 70% API figure would surprise most people, but these models are so good now that custom training is the exception, not the rule. I covered this same philosophy in my developer tools guide - reach for the simplest tool that solves your problem. And check how to read benchmarks before you pick a base model, because half the numbers you see on Twitter are nonsense.