>Model merging is, to me, one of the most counterintuitive empirical results in ...

airgapstopgap · on Dec 21, 2023

This is not so surprising if you consider the fact that finetuning is extremely sparse and barely imparts any new knowledge to the model. The paper "Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch"[1] made this clear:

> We initially demonstrate that SFT LM (either encoder- or decoder-based) always tends to acquire excessively redundant delta parameters. To be specific, we present DARE, which randomly resets some delta parameters to zeros based on a drop rate p and subsequently scales the remaining parameters by a factor of 1/(1 − p). Despite its simplicity, with the assistance of DARE, when the LM model parameters reach 70 billion, we can eliminate up to 99% delta parameters with minimal impact on model performance (see Figure 1(a)). The more parameters the LM has, the larger p it can tolerate. This discovery suggests that SFT LM indeed learns a multitude of low-rank structures akin to LoRA [25]

Insofar as those adaptations are mostly distinct, you can just preserve both sets and that's what explains successes of merging, I guess.

1. https://arxiv.org/abs/2311.03099

refulgentis · on Dec 21, 2023

Funnily enough, and not so concidentally, this has been well known in practice by...drumroll please...America's greatest innovators, the Adult Entertainment Hobbyists.

It doesn't have order-of-magnitude, or I'd even wager 50%, benefits in enabling smaller models. But you nailed it exactly. Fine tune on dogs, fine tune on cats, then...just...average the weights. And you have something better than the original with minimal loss from finetuning.

LoRA's end up being more popular for that use case because they're easier to combine and mix, match, and scale. Model merging is still a key technique for a successful base model.