> Finding that 70% of attention heads and 20% of feed-forward networks can be ex...

sdenton4 · on Dec 15, 2023

Here's an explanation I hit on some time ago:

More parameters makes it easier to find solutions with low-energy.

Suppose we have a product of two variables z = x * y. And now suppose that the 'correct' product is z=2, and we're learning x and y. A very good analytical solution is x=1, y=2 (or vice versa) allowing us to eliminate either x or y from our learning problem. The total energy of (x, y) in this case is 1*2 + 2*2 = 5.

However, another solution is x = y = sqrt(2), which has energy 2: this solution is much closer to the origin. The extra variable means that we have a /surface/ of solutions instead of a unique solution, so we can hone in on ones that are easier to get to using our optimizer.

As you add more variables, you can find lower and lower energy solutions.

Consider that we initialize neural networks 'near' zero, and then walk with gradient descent in some direction towards a solution. Then adding lots of extra variables - wiggle room - makes it much easier to find a solution within walking distance of the (noisy) origin.

bjornsing · on Dec 15, 2023

Fits with my first intuitive guess: it’s implicit regularization (that works as you describe).

Would be interesting to try some explicit regularization. But unfortunately you need a million bucks to an experiment on LLMs. :/

gessha · on Dec 15, 2023

Do you know of any literature that looks into this? This is a pretty interesting hypothesis.

youngNed · on Dec 15, 2023

Because 70% of a big number is a lot more than 70% of a smaller number?

Not being facetious, I don't know the answer, but that's my best guess

danielmarkbruce · on Dec 15, 2023

lottery ticket hypothesis might be real

famouswaffles · on Dec 15, 2023

all LLMs are undertrained to some degree.

assuming the models are identical except one is bigger then the bigger model is better because 70% of a bigger number is larger than 70% of a smaller number.

Now if you train a smaller model much longer than the bigger model (more tokens) then you are reducing the level of "under-trainedness" to some degree. at some point, you may have a smaller model that is better than that larger model.

70% of a bigger number may be larger than 70% of a smaller number but no guarantee 70% of a bigger number is larger than say 90% of a smaller number and so on.

sodality2 · on Dec 15, 2023

If they're all equally un-pruned, sounds like they still maintain their linear scale of performance.

Just like quantization!

cubefox · on Dec 15, 2023

How does this answer the question?

danielmarkbruce · on Dec 15, 2023

The question sort of implies you couldn't prune the smaller models and see the same thing. So, the answer given is to consider that in both cases, you sort of only use 30% of the model. Bigger is still bigger. The basic intuition of more parameters = better holds.

cubefox · on Dec 15, 2023

The question seems to be one of the Chinchilla scaling law. We could train smaller models (than recommended by the law) but with more training tokens, and achieve the same loss. But we would need more compute performance for that.

So the question is, perhaps: Why are big models required for compute-efficient training?

danielmarkbruce · on Dec 16, 2023

I don't think so. Pruning a large model and training a smaller model isn't the same thing. It might appear to be the same thing, but it's not.

cubefox · on Dec 16, 2023

Do you expect a model which was overtrained (relative to the Chinchilla law) to be no more affected by pruning than a model of the same size that wasn't overtrained?

danielmarkbruce · on Dec 16, 2023

Can you reformulate this question? It's hard to know what you mean when you say "no more affected". How are you defining "more" ?

cubefox · on Dec 16, 2023

I mean stronger impact on loss or benchmark results.

danielmarkbruce · on Dec 16, 2023

I mean, in some relative (like 10%) or some absolute amount? I think I'd expect the "more trained" model to drop performance by less (as a %, which is hard to define here) but more in absolute sense. Which, is basically impossible to measure but even if it was measurable..I don't feel confident about that prediction, it's speculation.

bjornsing · on Dec 15, 2023

I don’t follow…

sodality2 · on Dec 15, 2023

The article mentions that models have a lot of extra information that is unnecessary. You asked why the large ones still outperform small ones. presumably they all have that inefficiency. But the large ones are still better. 30% of a big number is still bigger than 30% of a small number.

jncfhnb · on Dec 15, 2023

Undertrained does not mean bad. It means it could be better.

But I also disagree with takeaway.