Anthropic launched Claude Opus 4.5 on Nov 25 2025.

The new pricing at $5 and $25 per million tokens is a very welcome surprise. I did not see that coming at all.

Early impression#

The Pagoda Garden Voxel Test#

Rendered scene of pagoda garden voxel generated by Opus 4.5

The model’s attention to detail is amazing.

Rendered scene of pagoda garden voxel generated by Opus 4.1

Rendered scene of pagoda garden voxel generated by Opus 4.0

Claude Code v2.0.54 context usage, cost, usage, and duration

Opus improvement in Claude Code: 4.0 → 4.1 → 4.5 4.5 shows meaningful difference in the pagoda garden scene.

Cost going down: $0.94 → $0.70 → $0.41

Efficiency (tokens, duration):

4.5 – 72k total: 885 in, 10.6k out, 30.4k cache read, 15.2k cache write, 2m 58s
4.0 – 93k total: 24 in, 5.9k out, 66.2k cache read, 21.3k cache write, 2m 7s

Lines of Code (LoC): 397 → 531 → 722

Field Report#

Claude Opus 4.5 is a steady and worthwhile upgrade. It improves speed, reliability, and code assistance without trying to be revolutionary. The result is a smoother, more dependable model that helps more often than it gets in the way, even though it still has rough edges.

Reasoning and Problem Solving#

Opus 4.5 makes fewer logic mistakes and follows longer lines of reasoning with better consistency. It handles everyday development tasks with greater stability and avoids the contradictions that earlier versions sometimes produced. For example, older models could run tests, see failures, and still claim everything passed. Those mistakes are now far less common.

The improvement is noticeable, but not universal. Opus 4.5 handles well defined problems reliably, but it still struggles when tasks are highly unclear or involve complex layers of context.

Code Quality and Refactoring#

Code quality is stronger overall. Opus 4.5 produces cleaner abstractions more often and can untangle sloppy code with reasonable structure. Earlier versions tended to create awkward patterns like unnecessary wrappers or overly complicated classes that caused long term problems. Opus 4.5 is better at trimming away junk code, simplifying the design, and reorganizing everything into a cleaner version. Merging duplicated helper functions into a single, clear utility is a good example of what it does well.

It is not perfect. Some generated code is still clumsy, but the weaker cases are usually easier to fix than before.

Instruction Following and Reliability#

Opus 4.5 follows instructions more faithfully than earlier models. Where systems like Codex sometimes took shortcuts or produced incomplete answers for difficult tasks, Opus 4.5 typically sticks closer to the request and puts in more effort to solve the problem properly.

Autonomy and Debugging#

The model is more capable of working through issues on its own. It can produce small reproducible cases, search for the source of a bug, and suggest fixes. During long sessions, however, its effectiveness declines because it loses details over time.

Context Handling#

Context handling remains a limitation. The model compresses older parts of a conversation, and this causes it to forget important information during long debugging sessions or multi day tasks. Losing earlier steps can make progress harder.

Speed and Responsiveness#

Speed is one of the most noticeable improvements. Opus 4.5 responds much faster than its predecessor. This increase alone makes the model feel more helpful and more comfortable to use in daily development.

Out-of-Distribution Test Case#

Novel DSL-Based Game AI Generation

Much of this review is based on a multi day task where Opus 4.5 helped generate game playing AI agents (NPC) by producing behavior trees (BTs) expressed in a custom DSL. The DSL was designed to be both human readable and machine executable, serving as the bridge between LLM generated strategies and the system that executes them. For example, a simple FPS policy might include conditions for detecting enemies, selecting random visible targets, or moving toward known enemy locations, all described in clear DSL structures. This project also introduced a broader innovation: reframing decision making as a language modeling problem. Instead of training agents directly in an environment, I used Opus 4.5 to generate DSL representations of BTs. Despite the complexity of this setup (including the evolving design of the DSL and the need for precise policy structures), Opus 4.5 handled the task reasonably well and often interpreted goals accurately even when they were not fully defined.

Overall Capability#

Overall, Opus 4.5 behaves like a capable but inconsistent junior engineer. Sometimes it achieves results similar to someone with more experience, and other times it misses on simple tasks. It is clearly better than Opus 4, just not dramatically so.

Final Assessment#

Opus 4.5 is a good model. The improvements are meaningful but not groundbreaking. The increased speed, slightly sharper reasoning, and smoother performance make it more pleasant to use, but it has not changed my workflow in any major way. Mostly, it has led me to use Claude a little more again after switching to GPT-5 Codex, which still feels stronger for more complex software engineering challenges.

Claude Opus 4.5 Surprise Me