Hacker News
new
|
past
|
comments
|
ask
|
show
|
jobs
|
submit
login
rolisz
on Dec 15, 2023
|
parent
|
context
|
favorite
| on:
Do large language models need all those layers?
Mistral MOE model is 8*7B parameters, so you should compare training time to similar sized models, not to 7B ones.
versteegen
on Dec 15, 2023
[–]
Mixtral 8x7B actually has 46.7B total parameters, not 8*7B = 56B. The reason being that not all parameters are multiplied 8x.
Also it uses 12.9B parameters per token, not quite comparable to 7B models.
Guidelines
|
FAQ
|
Lists
|
API
|
Security
|
Legal
|
Apply to YC
|
Contact
Search: