Mistral MOE model is 8\*7B parameters, so you should compare training time to si... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

rolisz on Dec 15, 2023 | parent | context | favorite | on: Do large language models need all those layers?

Mistral MOE model is 8*7B parameters, so you should compare training time to similar sized models, not to 7B ones.

versteegen on Dec 15, 2023 [–]

Mixtral 8x7B actually has 46.7B total parameters, not 8*7B = 56B. The reason being that not all parameters are multiplied 8x.

Also it uses 12.9B parameters per token, not quite comparable to 7B models.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact