I've been running LLMs locally for about three months now. Not as a replacement for Claude or GPT-4, but as a complement for specific use cases where local inference makes more sense than API calls. Ollama makes this ridiculously easy. Here's everything I've learned.
Why run models locally
Three reasons. First, privacy. Some code and data shouldn't leave your machine. Client projects with NDAs, proprietary algorithms, sensitive business logic. Running a local model means zero data transmission to third parties. Second, cost. If you're making hundreds of API calls per day for simple tasks (commit messages, docstrings, variable naming), local inference is essentially free after the initial setup. Third, latency. For small models on decent hardware, local inference is faster than waiting for an API round trip.
Setup with Ollama
Ollama is a single binary that manages model downloads, quantization, and serving. Install it, run ollama pull llama3, and you have a running LLM in under five minutes. No Python environments, no CUDA driver nightmares, no dependency hell. It just works.
The models I run regularly:
- Llama 3 8B for general-purpose tasks. Fast, good quality for its size, handles most simple coding questions well.
- Llama 3 70B for more complex reasoning. Needs 40GB+ RAM. Noticeably slower but the quality jump over 8B is significant.
- CodeLlama 34B for code-specific tasks. Better at code generation than general Llama models of similar size.
- Mistral 7B as a fast alternative to Llama 3 8B. Slightly different strengths, worth having both available.
Performance on my machine
I'm running a MacBook Pro M3 Max with 64GB RAM. Here are my real-world numbers:
Llama 3 8B: 40-50 tokens per second. Fast enough to feel instant for short responses. A typical function generation takes 2-3 seconds. Llama 3 70B: 8-12 tokens per second. Usable but you feel the wait on longer responses. A complex explanation takes 15-20 seconds. CodeLlama 34B: 15-20 tokens per second. Good sweet spot of speed and quality for coding tasks.
If you're on an older Mac or a machine with 16GB RAM, stick with the 7B/8B models. The larger models will either not fit in memory or swap constantly and become painfully slow.
When local beats the API
Commit message generation. I have a git hook that runs the diff through Llama 3 8B and suggests a commit message. This happens in under a second, no API call needed, no cost, no latency. I accept the suggestion about 70% of the time and edit the rest.
Docstring generation. Select a function, pipe it to Ollama, get a docstring back. Again, faster than an API round trip and good enough for this simple task. You don't need GPT-4 intelligence to write "returns the sum of two integers."
Code explanation for personal notes. When I'm reading unfamiliar code and want a quick summary, a local model is perfect. The quality doesn't need to be publication-ready, it just needs to help me understand the code faster.
Sensitive code review. When working on client projects with strict data handling requirements, I can still get AI assistance without sending their code to external servers.
When the API still wins
Anything requiring strong reasoning. Architecture decisions, complex debugging, nuanced code review. The quality gap between local 8B models and Claude/GPT-4 is enormous for these tasks. Don't try to save $0.02 on an API call when the answer quality matters.
Long context tasks. Local models typically support 4K-8K context windows. Claude handles 200K. If you need to analyze multiple files together, local models can't compete.
Multi-step conversations. Local models lose coherence quickly in back-and-forth debugging sessions. The larger API models maintain context and reasoning quality across many turns.
My integration setup
I use Ollama's API endpoint (localhost:11434) with a few custom scripts. A shell function ai that pipes stdin to Ollama and prints the response. A Raycast extension for quick queries. A VS Code extension that uses the local model for inline completions when I'm working on sensitive projects.
The key is having both local and API models accessible with minimal friction. I shouldn't have to think about which model to use for each task. The routing should be automatic based on the task type.
Bottom line
Ollama makes local LLMs practical for everyday development. It won't replace Claude or GPT-4 for hard problems, but it handles the high-frequency, low-complexity tasks that make up a surprising portion of daily AI usage. If you have a Mac with Apple Silicon and at least 16GB RAM, set up Ollama this weekend. It takes 10 minutes and you'll find uses for it immediately.