I'm more curious about the caching: > (2) For all models, the input cache hit pr...

Palmik · 2026-05-23T08:25:06 1779524706

DeepSeek V4's KV cache is very efficient due to its heavily compressed and sparse attention architecture.

DeepSeek V3.2 which uses DSA only (sparse attention, but without compression from HCA and CSA) is a smaller model but uses 10x more memory at 1M context window compared to DS V4 Pro.

Also, I have to say, DeepSeek's API has a very good cache hit rate. With the same workload, I see ~80% KV cache hit rate with the DS API vs ~50% with the major western inference providers for open weight models.

maxdo · 2026-05-22T21:58:58 1779487138

Flash on it's own is not a very competitive model, it's pricing is within ranges of everything else on the market.

Probably the most direct competitor of Flash model :

GPT 5.4 mini

Cache Read $0.075 /M tokens

Gemini 3 flash :

Cache Read $0.05 /M tokens

e.g nothing very magical or ground breaking.

freehorse · 2026-05-22T22:37:39 1779489459

Cache read for dp4-flash is $0.0028 /M tokens, which is more than 10 times cheaper (and also much cheaper for cache miss and output tokens).

Have not actually compared it to other models, but I would not consider it in the same price range.

maxdo · 2026-05-23T02:47:13 1779504433

this price only available if you ok to send your data to Beijing Volcano Engine Technology Co. for the rest open router vendors it is not the same.

csunoser · 2026-05-24T19:40:05 1779651605

Not sure why you are downvoted, this is essentially correct (assuming Volcano Engine tech refers to Deepseek as provider).

wolttam · 2026-05-23T03:54:04 1779508444

A big point of DeepSeek V4 is the significantly reduced KV cache size.

maxdo · 2026-05-22T21:51:16 1779486676

Sonnet : Cache Read $0.30

Gemini 3.5 flash : Cache Read $0.15

minimaxir · 2026-05-22T21:59:43 1779487183

For Sonnet, that's 10% of input cost (and requires paying for the cache)

For Gemini 3.5 Flash, it's also 10% of input cost.

Which is why 2%/0.8% change the economics in a meaningful way, given the input/cache-heavy way agents operate.

throwdbaaway · 2026-05-23T01:28:45 1779499725

And their disk-based caching is amazing. I got a long 700k context session spanning more than a week, with pauses in between that was longer than a day, and some rewinds mixed in as well.

Stats from pi:

↑400k ↓438k R432M 71.9%/1.0M

Half a billion tokens, $2.12

kingstnap · 2026-05-23T01:14:19 1779498859

Anthropic's caching requires you to pay a $0.75/Mtok for Sonnet and $1.25/MTok for Opus as a surcharge on top of the original input token cost. It's not even automatic.

If you are reading ~8 times (8 total back and forth tool calls) that means that cache reads in some sense cost ~$0.4 / M toks (Amortizing the write surcharge over all reads).

It's really quite ridiculously expensive considering what you are paying for is some residence on a VRAM that sometimes gets offloaded to NVMe.

maxdo · 2026-05-22T21:53:51 1779486831

GPT 5.4 Cache Read ≤272K $0.25

And it's multi modal, and available at whatever you might imagine rates limits.