Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Mistral MOE model is 8*7B parameters, so you should compare training time to similar sized models, not to 7B ones.


Mixtral 8x7B actually has 46.7B total parameters, not 8*7B = 56B. The reason being that not all parameters are multiplied 8x.

Also it uses 12.9B parameters per token, not quite comparable to 7B models.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: