Right. Opus 4.5 8 months ago, good enough for agentic coding. How far behind that are open weight models? More than 8 months? But how much more? When will they reach Opus 4.5 level? A few months from now? A year from now? Never?
GitHub Copilot in vscode has two ways to access Opus: the Copilot harness or the Claude Code Agent SDK within Copilot.
And that's if we assume that the vscode GHCP default Agent ("Local") is the same as the "Copilot CLI" one that is also selectable in vscode. I have not tried that one.
A few weeks ago the Claude Code Agent SDK was much better than the default Copilot Agent, but nowadays I am not sure.
I don't know about better but it's certainly different. It's painfully slow through claude code vscode extension compared to copilot but maybe "smarter", I feel like I have to correct it less using sonnet on both. I don't use opus much because of the cost but coworkers say the difference between harnesses there is also pronounced.
I've tried Opus 4.6 in the Opencode harness through the Github Copilot API, and I've tried Opus 4.8 in Claude Code. I found I preferred Opus 4.6 in Opencode (and in general, I like Opencode much more in that it hid less from me). I found both to be pretty similar as far as efficacy (I was surprised that Opus 4.8 felt like such a minor improvement over 4.6).
GLM 5.2 came out today and the early reports have been quite good. Very difficult to run except on prosumer hardware, but small business could quite easily (or something like open router).
Opus also has a deeply ingrained personality that always de-rails sneakily into what it's taught, not what the user intends. This is good if the user doesn't know the details of the work they need performed and a huge time waste when the user knows exactly how something needs to be implemented.
I have found claude models, especially fable, to be impossible to work with when the work requires reading papers from days ago and reasoning on top of the findings in it. I have multiple long sessions with opus (not as many with fable as it got taken down quickly) where it keeps fighting me on problems, sayings "that's not how it works" / "that is not possible", followed by me linking the paper (after i've told it to actually read up on the latest research in this field), and it hits me with the usual "You were right.". If your workflow is using the exact tools, frameworks, git layouts that claude expects, it can be magical, yes. But it is very heavily optimized to never say 'I am not sure' (as that gives 'bad vibes') and instead lean on its (nowadays with the speed of things DOE) knowledge to formulate a reasonable sounding answer, dissectible only if you already know the answer beforehand (which defeats the purpose of using it in the first place).
Qwen3.6 27B (the only <100B model worth looking at in my experience) is dumb, knows it, and will fight tooth and nail to complete the task it was given, gaining the needed context (online or file-wise) in the meantime. If you mention it should read papers, it goes and reads a pile of papers. If you tell it 'implement MCP in my app', the result will (probably) be catastrophic. If you instead describe where the feature should sit, how it should handle edge cases, what use cases it needs to attend to, and to first look online for reference implementations, it does it and does it well.
Knowing what is in context, what should and shouldn't be there, and how to manage it for the specific model you are using (as every model, even in the same family, behaves differently to differently worded prompts) is what makes or breaks them. They are just auto-complete, they complete text based on what is already there, it's not magic.
So yes, while this small open-weights models are not opus 4.5, it's good precisely because if that, because it is a good tool and a bad 'coworker replacement'. If you want the latter, kimi is already there, it has started to not believe the user and do what it was taught just like claude models (which is helpful when you don't care about implementation specifics or performance/security). GLM models (mostly 5.1, i haven't tested 5.2 extensively yet) have fixed a lot of low-level programming issues I've had that opus just walks in circles and writes reports that "it doesn't/can't work". That is to say, open-weights, in many cases, have already surpassed Opus. I can't comment on gpt 5.5, but while I used 5.4, it also performed a lot more tasks without being fussy than opus 4.6/4.7.
> I have multiple long sessions with opus (not as many with fable as it got taken down quickly) where it keeps fighting me on problems, sayings "that's not how it works" / "that is not possible", followed by me linking the paper (after i've told it to actually read up on the latest research in this field), and it hits me with the usual "You were right.".
I genuinely do not understand why people not only just put up with this but also pay _a lot of money_ for the _privilege_ of doing so.
It's like having _the worst_ colleague but you actually go out of your way to talk with the guy. Why.
I agree 100%. All of the models do it to some extent after the context gets tired, but opus is the worst and the sneakiest. And even when you do coerce it into doing what you want it feels like something out of r/maliciouscompliance. Much more so than most non-anthropic models. Way more so than codex/gpt or even gemini.