BiLLM is about post training quantization and BitNet trains models from scratch....

light_hue_1 · 2024-05-31T09:35:02 1717148102

There's no mathematical reason why doing quantization after would be worse than training from scratch. That's nonsense.

There's also no practical reason. Training quantized networks is often harder! This is why people quantize after the fact or do distillation.

Nor is there any reason why we won't find some projection of weights onto the BitNet manifold.

If it was published by academics I'd believe the cost argument.

This was published by MS. They can run this experiment trivially. I have friends at MS with access to enough compute to do it in days.

Either the authors ran it and saw it doesn't work or they're playing with us. Not a good look. The reviewers shouldn't have accepted the paper in this state without an explanation of why the authors can't do this.

This is the question that determines if this work matters or is useless. Publishing before knowing that isn't responsible on anyone's part.

anon373839 · 2024-05-31T11:38:59 1717155539

After Llama 3, does this paper’s result seem so far-fetched? That 8B parameter model showed that most of what the frontier models “know” can be represented much more compactly. So why couldn’t it be represented at low precision?