Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Finding that 70% of attention heads and 20% of feed-forward networks can be excised with minimal effect on in-context learning suggests that large language models are undertrained.

So why do the larger models perform so much better…?



Here's an explanation I hit on some time ago:

More parameters makes it easier to find solutions with low-energy.

Suppose we have a product of two variables z = x * y. And now suppose that the 'correct' product is z=2, and we're learning x and y. A very good analytical solution is x=1, y=2 (or vice versa) allowing us to eliminate either x or y from our learning problem. The total energy of (x, y) in this case is 1*2 + 2*2 = 5.

However, another solution is x = y = sqrt(2), which has energy 2: this solution is much closer to the origin. The extra variable means that we have a /surface/ of solutions instead of a unique solution, so we can hone in on ones that are easier to get to using our optimizer.

As you add more variables, you can find lower and lower energy solutions.

Consider that we initialize neural networks 'near' zero, and then walk with gradient descent in some direction towards a solution. Then adding lots of extra variables - wiggle room - makes it much easier to find a solution within walking distance of the (noisy) origin.


Fits with my first intuitive guess: it’s implicit regularization (that works as you describe).

Would be interesting to try some explicit regularization. But unfortunately you need a million bucks to an experiment on LLMs. :/


Do you know of any literature that looks into this? This is a pretty interesting hypothesis.


Because 70% of a big number is a lot more than 70% of a smaller number?

Not being facetious, I don't know the answer, but that's my best guess


lottery ticket hypothesis might be real


all LLMs are undertrained to some degree.

assuming the models are identical except one is bigger then the bigger model is better because 70% of a bigger number is larger than 70% of a smaller number.

Now if you train a smaller model much longer than the bigger model (more tokens) then you are reducing the level of "under-trainedness" to some degree. at some point, you may have a smaller model that is better than that larger model.

70% of a bigger number may be larger than 70% of a smaller number but no guarantee 70% of a bigger number is larger than say 90% of a smaller number and so on.


If they're all equally un-pruned, sounds like they still maintain their linear scale of performance.

Just like quantization!


How does this answer the question?


The question sort of implies you couldn't prune the smaller models and see the same thing. So, the answer given is to consider that in both cases, you sort of only use 30% of the model. Bigger is still bigger. The basic intuition of more parameters = better holds.


The question seems to be one of the Chinchilla scaling law. We could train smaller models (than recommended by the law) but with more training tokens, and achieve the same loss. But we would need more compute performance for that.

So the question is, perhaps: Why are big models required for compute-efficient training?


I don't think so. Pruning a large model and training a smaller model isn't the same thing. It might appear to be the same thing, but it's not.


Do you expect a model which was overtrained (relative to the Chinchilla law) to be no more affected by pruning than a model of the same size that wasn't overtrained?


Can you reformulate this question? It's hard to know what you mean when you say "no more affected". How are you defining "more" ?


I mean stronger impact on loss or benchmark results.


I mean, in some relative (like 10%) or some absolute amount? I think I'd expect the "more trained" model to drop performance by less (as a %, which is hard to define here) but more in absolute sense. Which, is basically impossible to measure but even if it was measurable..I don't feel confident about that prediction, it's speculation.


I don’t follow…


The article mentions that models have a lot of extra information that is unnecessary. You asked why the large ones still outperform small ones. presumably they all have that inefficiency. But the large ones are still better. 30% of a big number is still bigger than 30% of a small number.


Undertrained does not mean bad. It means it could be better.

But I also disagree with takeaway.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: