So how do openai and anthropic plan to keep customers when GLM-5.1 is just as go...

peder · 2026-05-27T18:15:07 1779905707

> I don't see the business model working.

Same. It's a nightmare from a Porter's Five Forces perspective.

There will be a ton of businesses competing in this space, and there will be something of a moat due to how capital intensive the business can be, but there will still basically be infinite competitors.

Great for consumers.

ex-aws-dude · 2026-05-27T19:53:37 1779911617

Well in reality AWS will just host one of them and most companies will use that

Like how snapchat kind of fell off because the feature could just be a subset of instagram

It seems like it would just become a commodity like EC2

bakies · 2026-05-28T01:05:38 1779930338

Snapchat is huge and growing

ex-aws-dude · 2026-05-28T16:13:11 1779984791

Totally, the stock is doing so well

beanard · 2026-05-28T16:55:15 1779987315

i can't wait for their foray into AR!

mesmertech · 2026-05-27T17:29:56 1779902996

For coding you always want to go with the best model in the category, not something that would be the best model if we went 1 year back which GLM 5.1 is, and I'm saying that as a big fan of GLM cause I run a translation site where GLM is good enough for the price.

Most of the money right now is in coding. Openai and Anthropic just have to be 6 months ahead of SOTA open source models and they'll capture most of the enterprise and dev market

binary0010 · 2026-05-27T17:36:49 1779903409

Yes I'm an engineer (20 years most in games/graphics industry) and only use it for code. I've been using glm 5.1 this week a lot. I went in expecting another "decent" but not really "up to standard" open source model.

I highly doubt I'll ever use Claude again.

I think you are wrong about Claude being any significant level better

cassianoleal · 2026-05-27T18:02:48 1779904968

I've been mostly coding with GLM-5.1 as well and I agree with you. DeepSeek V4 Flash is another very good surprise. Incredibly cheap, fast and effective.

MaKey · 2026-05-28T11:06:55 1779966415

I've been using DeepSeek v4 Flash with OpenCode for the whole week to refactor a Terraform code base I inherited and it worked surprisingly well.

aspenmartin · 2026-05-28T00:15:18 1779927318

Well I think there are a multitude of harder measurements that would disagree with you, but ultimately there is absolutely a use case for cheaper open models (or even cheaper tiers of proprietary models) and in fact the unsolved optimization everyone is trying to get to is how much spend to use for a given task. But there will always be a market, especially in enterprise, for the best performance there is to offer

ggttk · 2026-05-28T03:23:51 1779938631

Why are you boosting so hard? Lmao either you’re a paid poster or you own stock in a frontier firm. Which one?

aspenmartin · 2026-05-28T03:56:03 1779940563

Sadly neither :(

RevEng · 2026-05-28T03:01:53 1779937313

I strongly disagree. I'm an engineer - I'm all about the fastest, cheapest thing that meets the requirements. I don't need Opus 4.7, even for my complex programming tasks. It costs over 10x other models available that still give good enough answers. Those smaller models are also a lot faster to output tokens, which saves me time.

Once the model gets good enough, the returns on bigger models diminishes quickly. I don't want to spend 10x the money and wait 5x the time to get answers that are equivalent.

yokoprime · 2026-05-28T06:08:08 1779948488

Same here, i can't say i've seen any difference in 4.6 vs 4.7 other than price

odie5533 · 2026-05-27T19:35:38 1779910538

If I generate code with Claude, ChatGPT, and GLM 5.1, I can't say which model is which reliably. I exclusively use Claude more out of superstition than reason.

lunar_mycroft · 2026-05-27T22:45:49 1779921949

> For coding you always want to go with the best model in the category

This is transparently false, because the best "model" is still competent human developers. They're just more expensive. If you're willing to use current LLMs at all, it means you're willing to sacrifice quality for a better price, and your disagreement with the comment you were replying to is entirely about what the optimum tradeoff is.

aspenmartin · 2026-05-28T00:17:55 1779927475

Well it may be false that you always want the best model, but the point is performance of you+<agent> is far more cost effective than you+someone else

lunar_mycroft · 2026-05-28T12:50:13 1779972613

Maybe, but that's a different claim than the one I was responding to. And also raises the question of "if the lower quality but cheaper output of frontier models is more cost effective than humans, is the even lower quality but even cheaper output of OSS models is more cost effective still?" With an absolute rule like GP suggested ("no, you always want the best code generator") the answer is clear, but it get much murkier if you reject such rules (as you have to to be an LLM coding proponent)

aspenmartin · 2026-05-28T15:13:50 1779981230

I think that’s a fair and good q and point.

noname120 · 2026-05-28T07:53:37 1779954817

It was true 6 months ago, not anymore. Frontier models now outperform developers on many tasks, be it on quality/readability/maintainability, and let’s not talk about speed…

lunar_mycroft · 2026-05-28T12:40:00 1779972000

I've seen the code they produce without extensive help from human developers, this is clearly false.

Good to see the classic "yeah the models weren't good enough six months ago, but this time they actually are, promise! Please forget you were hearing the exact same thing six months ago!" is alive and well though.

aspenmartin · 2026-05-28T15:17:00 1779981420

Are you aware of performance trends though? You’re painting a picture that seems to ignore how things have consistently trended for many years now, even pre ChatGPT. It is absolutely data driven to say “an inflection point has happened within the last 6 months”. And that was also true 6 months ago (where people started using coding agents fairly consistently since sonnet 4). And it was true 6 months before that. It’s not like people are like “we’ve fixed all the bugs!” And then nothing has changed. I don’t necessarily agree with the parent poster that agents are better than humans but they are certainly much better at many tasks.

lunar_mycroft · 2026-05-28T16:13:38 1779984818

> Are you aware of performance trends though? You’re painting a picture that seems to ignore how things have consistently trended for many years now, even pre ChatGPT.

Models have been getting better, but all that follows from that is that newer models tend to be better than older ones. It doesn't follow that they have (or even will in the future) gotten better than anything else, be that human developers, a given definition of good enough, etc.

> It is absolutely data driven to say “an inflection point has happened within the last 6 months”.

With all due respect to OP (who I think is responsible for popularizing that way of phrasing it), I don't think it is when you consider the actual definition of "inflection point". At best I think you can say that models crossed a lot of developers definition of good enough around then, which is a different thing. The problem I have with that is that as a (mostly) outsider looking in, it doesn't seem like they're right.

aspenmartin · 2026-05-28T16:19:06 1779985146

> Models have been getting better, but all that follows from that is that newer models tend to be better than older ones. It doesn't follow that they have (or even will in the future) gotten better than anything else, be that human developers, a given definition of good enough, etc.

But this is not true, you’re saying we only have relative performance numbers and not absolute measures of capabilities and reliability but that’s simply not true. OSS benchmarks as well as the internal flywheels of these companies are good complementary measurements.

> At best I think you can say that models crossed a lot of developers definition of good enough around then, which is a different thing

That’s the inflection point. Implication is a massive jump in adoption. We’re not like pulling this out of a hat, there are a number of compelling datapoints. The onus is on people to bring actual evidence that contradicts all of the data and observations we have.

lunar_mycroft · 2026-05-28T16:48:34 1779986914

> you’re saying we only have relative performance numbers and not absolute measures of capabilities and reliability but that’s simply not true.

No, I'm saying that the claim you were making ("current models are better than some non-model based standard X") does not follow from your premise ("current models are better than past models"). It's possible that your claim is still true (although I don't think it is for most of the values of X that matter), but that wouldn't change the fact that the argument made is invalid.

As stated, your argument was basically the classic "my 3-month-old is now twice the size he was when he was born" meme, except if the tweet claimed that the kid currently out weighed an elephant.

> That’s the inflection point.

No, it isn't. An inflection point is when the direction of curvature changes. If we crossed over into the diminishing returns part of the logistic function, that would be an inflection point (as would the case where we had been in the diminishing returns regime, but then progress went back to speeding up).

> Implication is a massive jump in adoption.

The point I made was that "a massive jump in adoption" doesn't actually imply "the models are actually good enough now", only that a lot more people think they are.

aspenmartin · 2026-05-28T17:42:23 1779990143

OK I am having the wrong conversation, that you are right -- parent OP saying

- best model is still a human: this I SORT of agree with, but like I say its uneven

- response is: "this was true 6 months ago but is now false" -- that is sort of a mixed bag; if its saying we can now replace SWEs thats demonstrably wrong, if its saying it now clearly has superior abilities in many parts of the SWE workflow then its demonstrably right. I would argue this has been true for longer than 6 months.

- you say: "I've seen the code they produce without extensive help from human developers, this is clearly false." -- I agree with you that you need to help coding agents substantially, but I think at this point the convo is unclear what anyone is actually addressing or responding to

> No, I'm saying that the claim you were making ("current models are better than some non-model based standard X") does not follow from your premise ("current models are better than past models").

that isn't my premise though, but I admit I misread you. I think you are saying "people keep saying it's good enough to replace SWEs(?)" every six months and they are wrong every time. I don't disagree that we have not gotten to a "we dont need SWEs anymore" point, but I think its a bit of a strawman: who is making the claim you are addressing?

> As stated, your argument was basically the classic "my 3-month-old is now twice the size he was when he was born" meme, except if the tweet claimed that the kid currently out weighed an elephant.

No no, I'm talking about performance in absolute terms. These are strong proxies (SWE-bench, etc) though they have serious limtiations. The "yea but when is failure rate low enough for us to replace an entire tranche of processes" thats a harder question to answer but the strong proxy for that is adoption.

> The point I made was that "a massive jump in adoption" doesn't actually imply "the models are actually good enough now", only that a lot more people think they are.

No but then the point I'm making is we're drifting further and further away from Occam's razor.

> No, it isn't. An inflection point is when the direction of curvature changes. If we crossed over into the diminishing returns part of the logistic function, that would be an inflection point (as would the case where we had been in the diminishing returns regime, but then progress went back to speeding up).

I admit inflection point may be the wrong term here, but I hope you know at least what I'm trying to say; maybe like regime change or something. But plenty of data supports a major change around Nov to ~Jan. revenue, weekly active users, business subscriptions, GitHub commit estimates, you pick which is your favorite data source but they all are complimentary and all point to the same thing.

lunar_mycroft · 2026-05-28T18:57:37 1779994657

First, to clarify my own position here: I use LLMs for code review, to help with some planning, for the occasional throwaway prototype, and as a more advanced rubber duck, but I do not let LLMs write code I care about, even with human review (because human review is imperfect).

> its uneven... if its saying it now clearly has superior abilities in many parts of the SWE workflow then its demonstrably right.

In what ways are current LLMs better excluding speed and cost (because getting those things by relaxing constraints on quality has always been trivially possible)? Even the fabled (heh) Mythos seems to be at best roughly equivalent to a competent human security researcher.

> I think you are saying "people keep saying it's good enough to replace SWEs(?)" every six months and they are wrong every time. I don't disagree that we have not gotten to a "we dont need SWEs anymore" point, but I think its a bit of a strawman: who is making the claim you are addressing?

Most of them aren't saying that the models are good enough to full replace developers, but this definitely isn't a strawman. I've been seeing the same basic claim for at least 18 months at this point.

> No no, I'm talking about performance in absolute terms. These are strong proxies (SWE-bench, etc)

Unless you have some non-LLM scores to compare to, those are still relative measures. They show/suggest that LLMs are getting better (at least in some ways), but without a definition of "good enough" in the same metric, that isn't sufficient to say whether or not they are.

> No but then the point I'm making is we're drifting further and further away from Occam's razor.

Both sides of the debate have to explain the fact that a lot of developers disagree with them, so I don't think this argument really works.

aspenmartin · 2026-05-28T20:28:12 1780000092

> I do not let LLMs write code I care about, even with human review (because human review is imperfect).

That's fine but you are in the quickly vanishing minority.

> In what ways are current LLMs better excluding speed and cost (because getting those things by relaxing constraints on quality has always been trivially possible)? Even the fabled (heh) Mythos seems to be at best roughly equivalent to a competent human security researcher.

Well this is what I mean by benchmarks and measurement efforts. Lots of gaps in capabilities but we've had say superhuman competitive programming performance for awhile (including on fresh tasks not in training sets), extremely strong performance (super-p90-engineer) on say language-to-language porting, RE-bench (ML research engineering benchmark from METR) is already clearly above human perf, Mythos clearly (unless you believe this is all a massive fraud) has superior cyber capabilities, etc. Also, why do you discount speed and cost so much?

> Most of them aren't saying that the models are good enough to full replace developers, but this definitely isn't a strawman. I've been seeing the same basic claim for at least 18 months at this point.

Yea but what's the basic claim you're referring to here? Every model iteration is a significant bump up in performance according to a lot of complementary and principled measurements. What's been the thing that hasn't been true?

> Unless you have some non-LLM scores to compare to, those are still relative measures. They show/suggest that LLMs are getting better (at least in some ways), but without a definition of "good enough" in the same metric, that isn't sufficient to say whether or not they are.

There are human baselines in plenty of these benchmarks number one, and number two while no one is going to be able to tell you "once SWE-Bench Pro perf numbers get to X we can then refactor our existing process to completely offload task Y to agentic frameworks" thats a bit of a crazy ask. These numbers are pretty interpretable and many are pretty robust to things like training set leakage. What would you want to see here?

> Both sides of the debate have to explain the fact that a lot of developers disagree with them, so I don't think this argument really works.

Yet one side has a mountain of hard evidence and the other side has...an outdated n < 20 METR study using Sonnet 3?

lunar_mycroft · 2026-05-29T15:20:17 1780068017

> we've had say superhuman competitive programming performance for awhile

Fair. Question though, is this when compared to competitive programmers, or developers in general?

> extremely strong performance (super-p90-engineer) on say language-to-language porting

I'd need to see the methodology here and could easily be wrong, but I suspect this is largely down to "faster" and "willing to do a lot more of it without complaining"

> RE-bench (ML research engineering benchmark from METR) is already clearly above human perf

This pretty much has to be "relative to devs who don't specialize in that area", because if it wasn't the frontier labs wouldn't be paying a fortune to hire ML researchers.

> Mythos clearly has superior cyber capabilities

Based on Daniel Stenberg's experience with it [0], it seems like it's at best roughly on par with human experts. It's advantage is cost/speed.

> Also, why do you discount speed and cost so much?

Because in all the domains LLMs are applicable to, getting something cheaper/faster at the expense of quality isn't new or particularly interesting.

> Every model iteration is a significant bump up in performance according to a lot of complementary and principled measurements. What's been the thing that hasn't been true?

That they were good enough. To reuse the baby analogy, if every week your friend told you that their infant child was now heavier than an elephant (while acknowledging that the baby was lighter than one the previous week), and every week that turned out not to be true, it wouldn't be a defense of your friend to argue "ah, but the baby was heavier every week than the week before".

Also worth noting that as of ~8 months ago, while benchmark scores were steadily increasing, merge rates (aka whether the code was "good enough") were not [1].

> thats a bit of a crazy ask.

Why? If you use LLMs to do anything you're basically doing that already, it's just that the scope of your Y is smaller. Either the benchmarks are irrelevant and you're using something else to determine when that's appropriate for a given Y, or you do in fact have a value of X for the Y's you've handed over to LLMs.

> Yet one side has a mountain of hard evidence and the other side has...an outdated n < 20 METR study using Sonnet 3?

There's a lot of irony here, because by far the most common pro-LLM coding argument is "I feel like I'm producing good code faster with them", followed by "this other person feels like they're producing good code faster with them".

Also note that the most important part of the METR study you reference wasn't the slowdown they observed, it was the dramatic disagreement between what the participants thought the impact of AI was vs what it actually was. That isn't dependent on the model.

[0] https://daniel.haxx.se/blog/2026/05/11/mythos-finds-a-curl-v...

[1] https://entropicthoughts.com/no-swe-bench-improvement

suddenlybananas · 2026-05-28T08:23:42 1779956622

Why is anthropic hiring software developers then?

aspenmartin · 2026-05-28T15:17:25 1779981445

Because they still need them?

suddenlybananas · 2026-05-28T15:37:29 1779982649

Why would they still need them if "[f]rontier models now outperform developers on many tasks, be it on quality/readability/maintainability, and let’s not talk about speed"

aspenmartin · 2026-05-28T15:39:54 1779982794

Because to replace a SWE you need them to reliably outperform developers on ALL tasks

suddenlybananas · 2026-05-28T16:02:50 1779984170

But anthropic already had plenty of developers. Why would they actively need to hire more if the workload is all being automated?

aspenmartin · 2026-05-28T16:07:10 1779984430

Because it’s a force multiplier at this stage in the capability ladder. ROI of developers is arguably higher or will become higher in a matter of months.

hajile · 2026-05-28T16:38:29 1779986309

That's some really fast goalpost moving.

If AI could outperform humans, Anthropic would NEVER release that model. Instead, they'd use it to create a new google, photoshop, office, windows, etc for cheap then undercut all those companies and taking over the entire software industry.

aspenmartin · 2026-05-28T17:12:44 1779988364

It can outperform humans, just unevenly. You’ll see a lot of the same dynamics as you see with Mythos which, tbh is kind of refreshing. I get the sense that Dario while of course forced to ruthlessly run a company is genuinely interested in figuring out how to roll this out as ethically as he can.

lunar_mycroft · 2026-05-28T16:52:38 1779987158

Worth noting that the person you're replying to is not the same as the one who said frontier models now outperform developers.

eikenberry · 2026-05-27T19:51:39 1779911499

> For coding you always want to go with the best model in the category [..]

And this is why many companies go out of business. You always want the best bang for your buck, sometimes this is the "best model" and sometimes it is not.

kgwgk · 2026-05-27T17:34:46 1779903286

For coding like for everything else in life cost is a factor.

mesmertech · 2026-05-27T18:14:23 1779905663

Cost for the value delivered. Like if you offered the current SOTA open source models at $0.1/M, I still think I'd be using Opus or 5.5 at $30/M. Or say GPT 5 which was released Aug 25, I don't think I'd use it for coding for even $0.1. I'd def find other uses for it(translations, agentic workflows, prompt guards etc), but for coding I don't think I'd ever completely switch to a SOTA open model

Unless ofc there was an actual speed difference, only reason I'd be willing to go with a worse model couple of percent worse than current best model is if the speed was at least 5x higher. Looking forward to kimi k2.6 offered publicly by Cerebras

kgwgk · 2026-05-27T18:33:23 1779906803

> I still think I'd be using

That's fine. Other people may not want to pay 300x more and will rather make do with last year's SOTA.

> For coding you always want to go with the best model

Maybe you meant "For coding I always want to go with the best model"?

mesmertech · 2026-05-27T21:40:35 1779918035

Based on current market for LLMs I'd say my use of "you" in the general is fine. Even openrouter which doesn't capture all of the SOTA closed models but nearly all of opensource model usage has Opus as 1st(on last week) on "Programming" category and 3rd in overall rankings

https://openrouter.ai/rankings

simonw · 2026-05-27T21:44:32 1779918272

I'd trust the OpenRouter rankings a lot more if they exposed the number of unique users for each model, as opposed to just a token count.

Currently I have no way of telling if big changes in their rankings are caused by a single "whale" switching providers, or if it's a more meaningful trend.

mesmertech · 2026-05-27T21:52:42 1779918762

My point was that even openrouter, the one place people who are looking for open source SOTA models go to, doesn't definitively have opensource models at the top. Esp considering quite a lot of the closed models usage is through AWS, GCP , Azure etc, probably dwarfing the usage on openrouter by a huge factor

danny_codes · 2026-05-28T05:02:18 1779944538

Why? If it's good enough, it's good enough. Though I read the code that gets vibed so maybe my use-case is different.

yokoprime · 2026-05-28T06:06:39 1779948399

It's driven a lot by the harness too. If you're using claude code, you're actively being pushed towards newer models, even though older ones work perfectly fine for your use cases

danny_codes · 2026-06-05T03:41:45 1780630905

Yeah wouldn't touch ClaudeCode when there are so many better harnesses that are free and portable. Seems like a waste of time to learn a proprietary tool when the FOSS ones are better.

solomatov · 2026-05-27T23:24:47 1779924287

>For coding you always want to go with the best model in the category, not something that would be the best model if we went 1 year back which GLM 5.1 is, and I'm saying that as a big fan of GLM cause I run a translation site where GLM is good enough for the price.

Currently, the difference is substantial, but what happens if capabilities saturate?

aspenmartin · 2026-05-28T00:16:51 1779927411

Then the house of cards comes crumbling down, but there is so much evidence to point to this not happening that it requires a bit of a theory for how that may happen

solomatov · 2026-05-28T00:53:51 1779929631

> but there is so much evidence to point to this not happening

Could you explain this?

aspenmartin · 2026-05-28T04:22:50 1779942170

Well I think there are several fairly stable trends that paint a pretty compelling picture:

- performance scales with compute very very reliably. We have “scaling laws” (and have for years) and they are almost miraculously stable and show no sign of being invalidated at all even at the very largest scales. There are some theoretical bases for this though I’m not as familiar with the details

- these scaling laws are on an unintuitive quantity (validation loss on pretraining datasets), so we can look at downstream performance. Benchmarks are a minefield of junk but there are many decent ones and enough variety of techniques and data sources and scoring methods etc that in aggregate they are useful. The single number that I think is the best summary statistic across the crazy (O(100k)) number of benchmarks is the “epoch capability index” (just some branding over a reasonably standard statistical model that was really well thought out and a great idea). The trends in this are extremely stable. Eyeballing the trend over time on their graph we’re getting basically a GPT-4 to GPT-5 level capability improvement every ~18 months

- coding agents are not limited by the quality of the human training data they’re trained on, this is such a massive misconception: human data is only a bootstrap to a reinforcement learning phase. This combined with the fact that we have verifiable rewards means it’s just a matter of when not if for any given level of reliability.

- the massive compute investment implies that the compute that we’re building over the next 2-3 years will 10x the effective compute for training models. That combined with various R&D contributions (historically which have been very significant and there is no shortage of wins here), better data curation and flywheels, richer data (wait until conversation capability gets good) means we have several orders of magnitude of runway that we know of, today.

In short I don’t see any compelling evidence to suggest all of the trends we observe in many different ways will end any time soon.

EGreg · 2026-05-27T17:32:10 1779903130

Most work is not coding.

And also, people have it wrong… their models are not the main problem anymore. It’s the RAG

tomrod · 2026-05-27T19:09:14 1779908954

Would love to hear more about your thought about the RAG.

simonw · 2026-05-27T19:14:06 1779909246

I think RAG is a mostly outdated concept now, it's been subsumed by the idea of a "agent harness" which is exactly what Claude Code and Claude Cowork and OpenAI Codex and Claude.ai and ChatGPT themselves have now become.

An agent harness with access to a good search tool is a much more interesting thing than 2024-era RAG systems.

tomrod · 2026-05-29T02:26:15 1780021575

I appreciate where you are coming from, as you have surfed the front of the wave of GenAI for years. From my point of view, there is interesting because something is SOTA, and there is interesting because there is still more to build. I definitely understand state of RAG tech. I also view it as barely utilized versus what we can do with it, hence my question.

Agent harnesses integrated into good search tools are definitely interesting. Knowledgebasing with partitions and similar structure also remains fruitful for applications, above and beyond standard ElasticSearch on a cache.

Traveler42 · 2026-05-30T23:42:18 1780184538

I generally agree with this, but would note that it assumes that the data is accessible from a web search. Some data sources will be private.

simonw · 2026-05-31T00:27:58 1780187278

You can configure extra search tools that search private data.

EGreg · 2026-05-28T21:34:36 1780004076

And how exactly does the agent harness surface ALL the right places that need to be updated, and reason about functions and APIs?

obsidianbases1 · 2026-05-27T17:35:13 1779903313

Depending on RAG is a workflow problem, not an AI problem

Andrex · 2026-05-27T18:53:13 1779907993

> For coding you always want to go with the best model in the category

Will this always be true? There will never be an event horizon/point of diminishing returns where something not-bleeding-edge is "good enough" for 51%+ of users?

mesmertech · 2026-05-27T21:48:10 1779918490

As long as closed source is 6 months ahead in terms of current difference. Although this is hard to figure out using simple percent based coding benchmarks, you def. notice it when you're actually trying to do a long task. Even simple things like UI "taste" is enough for me to use opus instead of 5.5 though even though 5.5 is strictly better for anything that doesn't have a UI, ie backend, scripts, making agent workflows etc

blackjack_ · 2026-05-27T18:49:10 1779907750

This is a silly take. There is a line of "good enough" for most coding (most CRUD apps and APIs are nothing special), and once we are past that, nobody will care about having the "newest, best" model except extreme outliers. And this base "good enough" model will become an ultra cheap commodity as we already see with GLM, deepseek, etc.

mesmertech · 2026-05-27T21:44:43 1779918283

As long as closed models are 6 months ahead I won't be switching from them to prev. 6 month SOTA open source models. Maybe its just a different calculation if you're in a job, but as an indiehacker I'll take any edge I can get

Ofc again, can be convinced to switch if there's however a clear speed difference, like 5x+ for a open source sota even if it was SOTA for 6 months ago

vidarh · 2026-05-28T12:33:27 1779971607

I have stats from a harness that tells me glm5.1 is far more cost effective for us than Opus with the rate of defects and rework taken into account. In fact, with a decent harness I'm now increasingly favouring eHaiku over Opus for execution too. Opus is still worth it for planning, though, and far better at one-shotting things.

Perz1val · 2026-05-28T08:05:52 1779955552

And you propose the same companies that have been cost cutting and avoiding buying you a chair for ever won't start objecting to a $200/dev/month subscription? The finance department won't have a say?

r0b05 · 2026-05-28T18:20:06 1779992406

Why do need to go with the best model for coding?

dogleash · 2026-05-27T18:58:30 1779908310

> For XXX you always want to go with XXX, not XXX

Oh, hey, I recognize you. Thank you for the very forward and thorough orbital sander recommendation at Home Depot. That's exactly what I wanted to deal with on my holiday weekend. You just know so much about this and the rest of us are simple passersbys.

mesmertech · 2026-05-27T22:01:12 1779919272

Yep sorry was just pulling it out my rear, not like a market trend that nearly every enterprise uses Anthropic or Openai models for coding or that Anthropic has had such ridiculous growth that they're 10x-ing year over year

dogleash · 2026-05-28T15:02:59 1779980579

I'm ribbing you for writing like a condescending guru that invalidates the evaluatory capability of your peers. Not the meat of your evaluation (not to say that it's any good either, just that it's irrelevant).

smokel · 2026-05-27T17:37:49 1779903469

For coding assistance, I have tried OpenCode with several large open models through OpenRouter. All were fairly bad compared to Claude Opus. Could you provide some hints on how I should be holding these open models so that I might get more value out of them?

I agree with the common trope that open models lag behind by about a year, but something magical happened just around a year ago when the state of the art models became extremely useful. By this reasoning we're about to see open models perform well, but I'm afraid there is more to it than just waiting for another revolution around the sun.

Note, my application is coding assistance. Open models can be great for other purposes.

tariky · 2026-05-27T18:43:21 1779907401

I tried almost all OS models on opencode, none of them is on levels as opus 4.7.

In latest experiment I used opus for implementation plan then used cursor composer 2.5 for execution.

I must say that combo is really good. Main drawback of claude code is that is super slow. So when paired with composer that is super fast it flies.

cainxinth · 2026-05-27T19:24:42 1779909882

No one is claiming that OS is as good. They are saying it isn't that far behind SOTA commercial products. So why pay exorbitantly just to get something only a few percent better than the free option?

But there have been very good open source office apps for decades and few enterprises use them, so perhaps this is just the nature of B2B purchasing committees and 'nobody getting fired for buying IBM.'

Alex-Programs · 2026-05-28T15:37:03 1779982623

Because failures compound. My productivity has substantially improved since I switched from open models to a Codex subscription, because it doesn't need hand holding, and it doesn't pull stupid tricks occasionally.

slopinthebag · 2026-05-27T19:09:56 1779908996

Do more planning yourself, be smart about the context, break down tasks into smaller components, give it more guidance. You can't just lazily prompt it to complete large features autonomously and expect good results.

amilios · 2026-05-27T19:26:48 1779910008

But if the closed-source models can do this without the additional effort, that's a significant gap, no?

bigfishrunning · 2026-05-27T19:42:00 1779910920

See that's the thing, they can't. Every model needs hand holding and guidance.

amilios · 2026-05-27T19:57:23 1779911843

some require less hand-holding than others though

myaccountonhn · 2026-05-28T10:28:39 1779964119

No one is trying to argue that OS models are better than Opus 4.7. It's simply that they're good enough and cheaper.

10000truths · 2026-05-27T19:40:03 1779910803

The point is that the price gap is so much larger than the capability gap, that even with the extra compute needed to make up for the lack of capability, you can still come out ahead in terms of amortized $/work done.

flexagoon · 2026-05-27T20:04:37 1779912277

Is it really when they are hundreds of times more expensive?

eikenberry · 2026-05-27T19:44:28 1779911068

That is the 3-6 month sota-open gap people talk about, a time-window that continues to move as new models are released on both sides.

grttq · 2026-05-27T22:39:48 1779921588

Do you know what economic trade offs are?

Both implicit and explicit..?

eikenberry · 2026-05-27T19:38:44 1779910724

+1 .. just wanted to reiterate that this is the answer. The open models work great if you just do a little more of the design/architectural work up front and organize your work appropriately.

aniceperson · 2026-05-27T20:52:51 1779915171

a good harness is supposed to do what you are describing. sonnet on pi.dev is pretty terrible but fast. Claude Code has ridiculous amounts of prompt engineering at system prompt level and sub session spawing combined with low temperature, to provide the predictable results people like. CC screws up and you never see, because the harness auto corrects, while on OSS you see everything, and does not comes with the level of monitoring by default.

doug_durham · 2026-05-27T19:53:24 1779911604

GLM-5.1 isn't just as good. It is no match for Opus running in Claude Code. Please try it yourself. Open source models are about a year behind at least.

jeremyjh · 2026-05-27T23:22:17 1779924137

This is profoundly misinformed. I use all three of those models regularly and the difference is just not that big anymore. GLM 5.1 is at least as good as Opus 4.5 - when it’s my dime it’s the primary model I use and switch to GPT 5.5 for planning and review but it’s also very capable at those things. If I had to pay API rates for everything there is no question I would only use GLM 5.1 (and Minimax for exploration tasks).

At work I mostly use Claude Code and a bit of Codex; personal projects are OpenCode and honestly I prefer it.

listless · 2026-05-28T03:27:34 1779938854

I would agree here. And in my experience Qwen 27B and Deepseek v4 are also extremely good.

None of them are quite opus, but they are damned close and a no brainer if you care at all about cost.

clhodapp · 2026-05-28T01:26:39 1779931599

In the second half of last year, I found that agentic coding with proprietary models (≈ vibe coding) reached the point where it actually speeds up my ability to deliver useful code at work. Before that, AI-based autocomplete definitely helped, but (despite the claims of the people selling AI coding tools) letting an agent author more than a file or so at a time (often a function or so at a time) required a very intricate plan or it would create a mess. Creating that plan or cleaning up the mess would take longer than just doing everything myself.

For me, it feels like widely available open models have recently crossed that same canyon. Are they as good as e.g. late-model Claude Opus? I don't think so. But they have absolutely gotten past the point where they are beneficial. This means that, for me, they are about six months behind.

jeremyjh · 2026-05-28T01:48:01 1779932881

Exactly this. GLM 5.1 is the first open model that I thought "actually worked" for agentic coding, which puts it in the same tier as Opus 4.5 - which was where I flipped.

osti · 2026-05-27T20:20:30 1779913230

For coding I wouldn't say a year, last year this time claude or gpt definitely weren't able to do what GLM is able to do today, but easily 6 months I'd say.

Not sure about other domains though.

_pdp_ · 2026-05-27T22:04:52 1779919492

A year behind is still very very very good at this price. ;)

RevEng · 2026-05-28T02:56:55 1779937015

I use composer-2 daily for complex programming tasks. It's a fine tuned Kimi 2.5 - nothing groundbreaking. I've even had reasonable success using Qwen 3.5 on my desktop GPU. Opus might be better, but it's certainly not necessary to get good results.

IAmGraydon · 2026-05-27T19:59:57 1779911997

The only way I see it working out for them is if some legislation is passed that eliminates the competition by making it illegal to run local models. They could claim that the models are dangerous and could be weaponized without oversight, or something along those lines.

locusofself · 2026-05-27T21:42:55 1779918175

Don't you need to spend 5-10 thousand USD to run these models that are "as good" as frontier models from 6-12 months ago? I haven't seen a convincing breakdown for ROI of running your own coding models. Especially against a $20 or even $200 plan

IshKebab · 2026-05-27T22:00:58 1779919258

I assume you can run them in the cloud. $5-10k doesn't sound like remotely enough to run a not-shit model locally based on my experience.

ruben81ad · 2026-05-28T07:54:54 1779954894

I have 26 years experience. I code using GLM-5.1. Fron time to time I switch to Codex / Claude, and honestly I don't understand why people uses Claude or codex. With the right prompting, GLM is awesome.

mock-possum · 2026-05-28T05:43:47 1779947027

Do you have a good source to refer to, to map out migration from Claude code to a cheap setup using small open source models like you’re describing? I’d certainly like to experience how good they’ve gotten.

arcanemachiner · 2026-05-28T18:12:58 1779991978

Download OpenCode and try OpenCode Go for a month. It's $5 USD for the first month. This will give you a taste of what the open model experience is like.

If you want to try the smaller models, just use them on an API service first. There is no model you can run locally (for less than $100k) that compares to GLM 5.1 though.

e2e4 · 2026-05-27T20:12:45 1779912765

Agree. Also reasonix with deepseek is super cheap and quality is only slightly worse (in my experience)

csomar · 2026-05-27T19:45:44 1779911144

They are both (and also spacex) sprinting for IPOs. They know that the opportunity window is closing fast and that advancement in model quality has largely plateaued in the last year. Take as much investor money as you can get away with for now.