DeepSeek Introduces Vision

jiehong · 2026-06-18T07:39:33 1781768373

For those not trying, this allows Deepseek to understand a picture (instead of just extracting text from it), and it can describe what's in the picture, but this is not an image generation system, so you can't ask it to modify an image.

Personally, I'm a bit surprised the DS chat app still doesn't offer its own text to speech and speech to text features (I know DS doesn't have any ASR model for example, but there are quite a few in the open).

paulluuk · 2026-06-18T08:17:15 1781770635

Can you explain what the benefits are of actually "talking" with the bot instead of typing and reading?

As someone who would rather send a slack message to a coworker rather than actually walking over and talk to them, the idea of having to talk with my laptop is not appealing at all, haha.

weitendorf · 2026-06-18T10:05:57 1781777157

I thought this way until I tried it, and the main difference is that when I'm managing tons of agents at once or just reviewing some plan / approving next steps, or need to give quick feedback/ask a simple followup, the voice interface makes me much faster and more likely to continue because it's lower friction (and in many cases that's good, though not all) and can be hands-free.

Actually, my thoughts on this matter changed so much that it inspired me to get much more into voice controls because I realized how this same problem was basically why some people sucked at remote work or weren't able to properly use tools like claude code, because it was essentially the same problem but worse (typing / messaging feeling too high-friction or raising the barrier for participation). I have a way to let Claude call me now to tell me stuff when I have a bunch of instances out doing stuff and then leave to go home.

I'm trying to get that better integrated in my devloop because I think it makes managing >4 agents simultaneously much more feasible and natural for some people (I used to play Starcraft a lot so I'm used to the multitasking, but it still takes sustained willpower to be constantly "driving" or monitoring things, or to field questions), especially ones who have never served as TLs or people managers before. IMO it's a big performance roadblock for a lot of developers to be treat directing multiple agents simultaneously as some kind of high-stakes/high-cost thing. The kind of developer who would not say anything in a team meeting unless prompted or who thinks everything is stupid by default (because they are afraid of making decisions / being wrong even if only briefly) is both very common and reluctant to work this way, but also really probably needs it to be as productive as more skilled developers.

itake · 2026-06-18T09:15:35 1781774135

I am someone that prefers a slack message to a coworker than talking to them and I use AI.

My current flow is: Google Eloquent to capture 127WPM (my typing is best case is 65wpm). This lets me get the thoughts out without thinking too much about structure or flow, the same way I would brain-dump type it.

Next I use AI to compress, summarize, and restructure to create a clear coherent message for my peer to read (which is way faster for them).

When communicating with AI, its the same thing, except I skip the second step since AI does a good job at understanding my ramblings.

----

It drives me crazy that some cultures only send voice messages to each other. It drives me crazy they can't be respectful of my time and use STT+AI to convert their 90 second monologue to a few written sentences.

cicko · 2026-06-18T09:24:14 1781774654

If you spend your life sitting in a chair, that's fine. I tend to get all kinds of ideas, questions, and research needs while I'm walking around. Typing a paragraph or two or context takes too much time and is very risky. Especially when driving. But also just walking, cooking, cleaning, etc. Sometimes it's just not practical - winter, carrying stuff... I mostly feel privileged if I can just sit at a computer and type my question and have the time to read the answer.

QuantumNomad_ · 2026-06-18T09:22:05 1781774525

When I was still using OpenAI, I used it among other things to translate from English to Spanish while talking to Spanish-speaking people in person.

I understand a bit Spanish but I don’t speak Spanish yet, and they don’t speak English.

I speak English to the AI and end with “translate to Spanish, translation only”, and then the AI says the thing I was saying in Spanish (not perfect but good enough, and also it has a slightly weird accent that might be it using English or English influenced text to speech even when speaking Spanish sentences?).

perching_aix · 2026-06-18T09:40:59 1781775659

It's easier, faster, and more natural to talk than to type for the vast, vast majority of people.

This trivial fact of life is observed every day by e.g.:

- students taking notes and finding it necessary to only jot down key facts so that they can keep up,

- stenographers who require special training and equipment to keep up verbatim with live speech in the courtroom,

- annoying colleagues who insist on "hopping on a quick call" or arranging big, wasteful, and disruptive meetings instead of just writing down their problem / sending a message or email,

- friends who insist on sending short voice messages in DMs instead of typing, because it's more "personal" that way (which to be fair it is, but not to the extent proclaimed).

arcanemachiner · 2026-06-18T09:10:35 1781773835

Much faster and better flow. Don't knock it til you've tried it.

throawayonthe · 2026-06-18T08:59:49 1781773189

it's very confusing. maaaybe if the stt is good and fast enough, speaking may be faster? english speakers can probably hit 150-180 wpm but seems like a hassle

stranded22 · 2026-06-18T08:56:23 1781772983

Accessibility.

rcMgD2BwE72F · 2026-06-18T06:58:48 1781765928

Points to https://chat.deepseek.com/sign_in for me, that's just a login screen. Anything page with some info?

RIshabh235 · 2026-06-18T07:09:35 1781766575

Not in official news yet, but works for me https://files.catbox.moe/hnnnlx.png

dude250711 · 2026-06-18T08:40:47 1781772047

OP made a mistake, confused HN and https://www.reddit.com/r/DeepSeek .

bjoli · 2026-06-18T07:00:48 1781766048

What has been going on with deepseek recently? I have gotten lots of replies in Chinese and even more frequently, reasoning in Chinese as well.

Is it a new silent update?

throwa356262 · 2026-06-18T09:57:28 1781776648

Happened to me with Claude, doesn't need to be a China thing.

Shank · 2026-06-18T07:04:42 1781766282

Well, it is a Chinese model, maybe it thinks better in Chinese?

bogdan · 2026-06-18T08:14:50 1781770490

Hànzì can use 30%-40% fewer tokens than English. So, yes, it probably thinks better in Chinese.

Razengan · 2026-06-18T08:39:10 1781771950

If so, would other models like ChatGPT benefit from translating the user's prompt to Chinese/Japanese and thinking in Hanzi/Kanji and then converting the response back to the user's language before displaying it?

cocoflunchy · 2026-06-18T08:46:51 1781772411

I believe that most reasoning models actually think in their own "language" which is not really understandable by humans. The thinking traces that are shown in the UI are actually summaries generated by a smaller model in plain english (or user language). Sometimes this leaks through and you see some chinese/japanese characters in e.g. Claude's reasoning.

dryarzeg · 2026-06-18T09:19:53 1781774393

As far as I'm aware, it's not true for models like DeepSeek or other Chinese open-weight models (at least those that I have seen); their reasoning traces are fully composed from some human language, be it English, Chinese or another one; by the way, most of them can adapt their reasoning based on user language, for example, if user speaks English the reasoning more likely will be in English.

I think that for DeepSeek problem (thinking and replying in Chinese) everything is kinda simpler: in their official chat, they're probably using some kind of system prompt which is (probably) written in Chinese, so that's why model may prefer Chinese in it's output.

kgeist · 2026-06-18T09:27:13 1781774833

Summaries by different smaller models are usually made by closed proprietary models like Claude as a way to combat the distillation of real reasoning traces by competitors. Open weight models show the real reasoning traces. Reasoning traces operate in the same space as the non-reasoning output. It's all just one large text for an LLM. Internally, reasoning is just ordinary chat completion between <think></think> tags.

seydor · 2026-06-18T09:01:51 1781773311

> summaries generated

Or hallucinated

bogdan · 2026-06-18T08:47:46 1781772466

There are other even more efficient ways of doing this, i.e. using images instead of raw text https://xcancel.com/karpathy/status/1980397031542989305?lang...

grogg · 2026-06-18T08:48:51 1781772531

Yeah, it’s why the Caveman skill includes a Wenyan mode.

https://github.com/JuliusBrussee/caveman

k__ · 2026-06-18T09:23:29 1781774609

Maybe, you could pipe it through T5 or something.

cicko · 2026-06-18T09:25:47 1781774747

it's a hint that you should start learning the new Lingua Franca.

serf · 2026-06-18T08:03:00 1781769780

This happens to me a lot when I ask a qwen3.6 model to respond to a question in JSON. No clue why.

surgical_fire · 2026-06-18T08:00:40 1781769640

I use DeepSeek daily, never happened to me.

I use the API however, not the chat interface.

abyssin · 2026-06-18T07:05:38 1781766338

It doesn’t seem that recent to me, at least been like that for six months.

RIshabh235 · 2026-06-18T07:06:44 1781766404

yes, kind of silent update plus they might have better chinese datasets and user data for their training, that might be leading to chinese preference.

alfiedotwtf · 2026-06-18T07:39:26 1781768366

Are you running out of context? I’ve found that tooling and giberish most of the time happens when I’m butting up against the high watermark of my context window. One other thing it could be, I’ve read that lower quanta like Q1 and Q2 for smaller models can leak Chinese

epolanski · 2026-06-18T07:18:34 1781767114

It never happened to me with Deepseek, but it happened multiple times with Kimi 2.6.

It also happened a handful of times with Anthropic models.

throwaw12 · 2026-06-18T08:06:55 1781770015

I wish they published a post where we read about capabilities, quality, accuracy and other parameters

tornikeo · 2026-06-18T07:27:36 1781767656

I really need this as an API.

Turns out, to use Claude Agents SDK, you need to have a vision enabled API. If Deepseek API could see, it can fully drive Claude Code and Claude Agents SDK. A project I'm working on relies on a Claude-in-CloudflareWorker setup and I've been relying on Qwen and gemini flash lite, both more expensive than Deepseek.

Can't wait to have it available on deepseek.

5701652400 · 2026-06-18T09:15:36 1781774136

same here. I am using Gemini 2.5 Flash as VSCode "vision proivder" for Deepseek V4 Pro, but it is expensive and not accurate. can't wait for native Deepseek vision.

petesergeant · 2026-06-18T07:50:04 1781769004

Have you looked at MiniMax or MiMo? Available today via OpenRouter, and it’ll make the path to porting to DeepSeek a line change https://openrouter.ai/collections/vision-models

insumanth · 2026-06-18T09:21:56 1781774516

Multi-Modal is the way to go. Deepmind nailed this a long back.

Zababa · 2026-06-18T09:40:44 1781775644

Deepmind hasn't produced any frontier model since Gemini 3.0 pro though.

arjie · 2026-06-18T07:20:55 1781767255

If they'd do one of those little extraneous additions like Qwen does, so that I can have DS4 Flash with Vision that would be great. I've got to run a separate model entirely so that I can get vision and I'd prefer to just put it all in one space.

RIshabh235 · 2026-06-18T08:23:05 1781770985

Maybe they will do now as they got huge funding.

earth2mars · 2026-06-18T07:02:48 1781766168

And it's really good and fast. Have tested with bunch of odd photos on what is happening. Overall the training set seems large enough to know what's what and where

RIshabh235 · 2026-06-18T07:08:02 1781766482

yes and I hope their rate of shipping increases after recent funding.

crvdgc · 2026-06-18T06:57:37 1781765857

Vision has been in A/B testing for a while now (at least in China). Is there an official announcement that this will be available for everyone?

RIshabh235 · 2026-06-18T07:05:04 1781766304

I haven't seen any official announcement yet, works for me though.

innis226 · 2026-06-18T06:57:06 1781765826

Nice, is this available in the API now as well?

naseemali925 · 2026-06-18T07:14:52 1781766892

I am also waiting on the vision support in API. Its the only thing blocking me from buying their subscription.

dakolli · 2026-06-18T07:59:45 1781769585

What subscription?

naseemali925 · 2026-06-18T08:53:47 1781772827

I mean't topup. They don't have subsciptions.

RIshabh235 · 2026-06-18T07:02:22 1781766142

Not in the api yet.

alexwwang · 2026-06-18T08:47:48 1781772468

Does the api support vision yet?

RIshabh235 · 2026-06-18T08:57:06 1781773026

No announcements about it yet.

alexwwang · 2026-06-18T08:59:34 1781773174

That makes sense. I haven’t found it work in api yet.

tw1984 · 2026-06-18T08:35:44 1781771744

what is more interesting to me is why it takes so long for them to support vision.

does it implies that Liang believes vision/voice is less important on its way to AGI?

RIshabh235 · 2026-06-18T09:08:36 1781773716

Might be compute bottleneck due to the US chips act and migrating to Huawei ecosystem.

thiago_fm · 2026-06-18T09:17:39 1781774259

Just wait until they release their coding model. Once they do an Opus-level coding model, the sandcastle of the AI economy in the US will fall

el_io · 2026-06-18T09:32:15 1781775135

They had deepseek-coder.

andrewstuart · 2026-06-18T07:12:30 1781766750

OpenAI and Anthropic need to get this free foreign competition banned.

0xpgm · 2026-06-18T09:01:04 1781773264

Is that before or after the OpenAI and Anthropic pay off all the people and companies who's copyrights were violated when they used their works for free to train their models?

At least DeepSeek freely gives back the benefits.

epolanski · 2026-06-18T07:19:18 1781767158

Care to expand on why? Or did you forgot the /s at the end?

dudisubekti · 2026-06-18T07:29:02 1781767742

I feel like '/s' has ruined irony on the internet. Irony is at its best if left ambiguous, lol.

pjc50 · 2026-06-18T09:48:13 1781776093

Too many people have said too many stupid things entirely seriously.

cromka · 2026-06-18T08:02:31 1781769751

Nah, they're serious actually!

Weryj · 2026-06-18T07:41:02 1781768462

Wait, did that need a /s?

ReptileMan · 2026-06-18T07:47:17 1781768837

If everything goes to plan everyone involved with big US models will be trillionaire and everyone else will poor and unemployed. If there are open and cheap to run Chinese models (and please god silicon) the financial house of cards that we have build will fall, people involved with big US models will be poor and unemployed, and everyone else will be slightly less poor and unemployed than in the first scenario.

What is good for Dario is good for America.

andrewstuart · 2026-06-18T07:57:58 1781769478

Why do you think it’s free?

Any ideas, theories where they get their payoff?

cromka · 2026-06-18T08:01:54 1781769714

Yes, subscription options they sell on deepseek.com