OpenAI o1: Testing the Reasoning Model

OpenAI's o1 model is different from everything before it. Instead of generating answers token by token, it "thinks" first, spending time on internal reasoning before producing a response. You can see the thinking process in the interface as it works through the problem. I've been testing it for two weeks, and the results are genuinely interesting.

How it feels to use

The first thing you notice is the wait. Ask GPT-4 a question and you get tokens streaming almost immediately. Ask o1 the same question and you watch a "thinking" indicator for 10 to 60 seconds before any output appears. This feels slow, and for simple questions, it is slow. You're paying a latency tax for reasoning you don't need.

But for hard problems, that thinking time produces noticeably better answers. The model is essentially doing what a careful human does: understanding the problem fully before attempting a solution, considering multiple approaches, and checking its work. The output reflects this. Answers are more structured, more thorough, and more often correct on the first attempt.

Algorithmic problems

I tested o1 on 10 LeetCode-style problems ranging from easy to hard. It solved 9 out of 10 correctly on the first attempt, including two hard problems that GPT-4 consistently gets wrong. The solutions were not just correct but well-reasoned. The model explained its approach, analyzed the time complexity, and considered alternative solutions before presenting the best one.

The one it got wrong was a dynamic programming problem with a tricky state transition. Even there, the approach was reasonable, it just made an error in defining the recurrence relation. GPT-4 scores about 6 out of 10 on the same set, with most failures on medium and hard problems.

For algorithmic reasoning, o1 is the best model I've used. Period.

System design

I gave o1 three system design prompts: a URL shortener at scale, a real-time chat system, and a distributed task queue. The answers were comprehensive and showed genuine understanding of the tradeoffs involved.

What impressed me most was how o1 handled follow-up questions. When I asked "what if we need to support 10x the traffic you assumed?" it didn't just say "add more servers." It identified specific bottlenecks in its original design, explained which components would fail first, and proposed targeted solutions for each. This kind of reasoning about failure modes and scaling limits felt more like talking to an experienced engineer than a language model.

GPT-4 gives good system design answers too, but they tend to be more textbook. o1's answers felt like they came from someone who had actually built and operated these systems.

Debugging

I gave o1 five bugs of varying difficulty, from a simple off-by-one to a race condition in async code. It found all five, including the race condition that I've used as a litmus test for AI models for months. No other model has found that one without a hint.

The debugging process was visible in the thinking phase. I could see the model tracing execution paths, considering thread interleaving, and identifying the exact sequence of events that triggers the bug. This transparency is valuable because it lets you verify the reasoning, not just the conclusion.

Where o1 doesn't make sense

Quick questions. If I need to know a function signature, a terminal command, or a quick code snippet, waiting 30 seconds for o1 to "think" is absurd when GPT-4 can answer in 2 seconds. o1 is not a replacement for your everyday coding assistant.

Simple code generation. Writing a CRUD endpoint, a utility function, or a React component doesn't benefit from deep reasoning. The code is straightforward, and the extra thinking time adds latency without improving quality.

Anything time-sensitive. Interactive development, quick iterations, rapid prototyping. The latency makes o1 impractical for fast feedback loops.

Cost

o1 is expensive. The thinking tokens count toward your usage, and a complex problem might use 10-20x more tokens than GPT-4 for the same question. For my testing, a problem that costs $0.05 with GPT-4 might cost $0.50 with o1. This is fine for occasional use on hard problems. It's not viable for high-volume automated pipelines.

My take

o1 is a specialist, not a generalist. It's the model you reach for when the problem is genuinely hard and accuracy matters more than speed. Algorithm design, complex debugging, system architecture, and mathematical reasoning are its sweet spot. For everything else, faster and cheaper models are better choices.

The reasoning approach is the future though. Watching o1 think through a problem and arrive at a correct solution that other models can't touch is convincing evidence that "think more, not just bigger" is a viable path to better AI. I expect every major model to adopt some form of reasoning chains within the next year. o1 is the proof of concept.