
Choosing the Right LLM for Real Coding Work: Haiku, GPT-5.2, and When to Use Opus
The Problem: “Which Model Is Actually Best for Coding?”
If you follow AI news, you’d be forgiven for thinking the answer changes every week.
New models drop, benchmark charts circulate, and suddenly everything is “state of the art.” But when you’re actually shipping software—games, apps, backends—the question isn’t which model scores highest, it’s:
Which model should I use for this task, right now, without wasting time or money?
After spending real hours inside VS Code, GitHub Copilot, and agent-style workflows, a few patterns become very clear.
Benchmarks Matter — But Only in Context
The most useful public benchmark for real-world coding today is SWE-Bench Verified. Unlike toy problems, it measures whether a model can actually fix bugs in real repositories.
Approximate late-2025 standings:
- Claude Opus 4.5: ~80–81%
- GPT-5.2: ~78–80%
- Claude Sonnet 4.5: ~76–78%
- GPT-5.1 Codex-Max: ~75–78%
- Claude Haiku 4.5: ~50–55%
A few key takeaways:
- The gap at the top is small.
- Improvements are now incremental, not revolutionary.
- Small models have gotten shockingly good.
Why Claude Haiku “Feels Better Than It Should”
Claude Haiku 4.5 is a standout because it punches far above its weight.
Despite being a “small” model, it:
- Edits existing code cleanly
- Follows local repo conventions
- Responds fast enough to feel like a true copilot
In practice, Haiku often beats larger models for:
- Tight iteration loops
- Small refactors
- Navigating unfamiliar codebases
That’s why it feels so good day-to-day—even though it’s nowhere near the top of the leaderboard.
GPT-5.2 vs GPT-5.1 Codex-Max
This is a common comparison, and the answer is nuanced.
GPT-5.2 is generally better overall, especially when:
- Reasoning matters as much as code
- You need to decide what to change, not just how
- Bugs span multiple files or concepts
GPT-5.1 Codex-Max still shines when:
- Tasks are well-scoped
- You want aggressive, high-volume code generation
- Determinism matters more than judgment
If you can only pick one today:
GPT-5.2 is the better default.
Why Opus 4.5 Costs 3× (and When That’s OK)
Claude Opus 4.5 didn’t get expensive by accident.
You’re paying for:
- Longer internal reasoning
- Better tradeoff evaluation
- Fewer “confident but wrong” answers
That makes Opus ideal for high-leverage moments, not everyday work.
Good times to use Opus:
- Architecture decisions
- Deep refactors
- Repo-wide bugs that “don’t make sense”
- Evaluating AI-generated code for correctness
Bad times to use Opus:
- Boilerplate
- Formatting
- Iterative trial-and-error
- Anything you expect to redo 5–10 times
A simple rule works surprisingly well:
If you wouldn’t pay a human $3–$5 for this answer, don’t use Opus.
A Cost-Aware Model Stack That Actually Works
A practical workflow looks like this:
-
Claude Haiku 4.5
Fast edits, daily coding, Copilot-style assistance -
GPT-5.2
Hard bugs, reasoning-heavy tasks, “what’s really going on here?” -
Claude Opus 4.5
Rare escalation for confidence, correctness, and big decisions
Think of Opus as a principal engineer, not a typing assistant.
The Bigger Pattern
What’s most interesting isn’t which model is “best.”
It’s that:
- Small models are now good enough for most work
- Top-tier models are converging in capability
- The real skill is model selection, not blind upgrading
If you treat LLMs as tools with roles—not trophies—you get better software and keep your costs under control.
That’s the difference between experimenting with AI…
and actually shipping with it.