Hacker News new | past | comments | ask | show | jobs | submit login

BiLLM is about post training quantization and BitNet trains models from scratch. You do realize that one of those is going to give significantly worse results and the other is going to be significantly more expensive, well into the millions of dollars?



There's no mathematical reason why doing quantization after would be worse than training from scratch. That's nonsense.

There's also no practical reason. Training quantized networks is often harder! This is why people quantize after the fact or do distillation.

Nor is there any reason why we won't find some projection of weights onto the BitNet manifold.

If it was published by academics I'd believe the cost argument.

This was published by MS. They can run this experiment trivially. I have friends at MS with access to enough compute to do it in days.

Either the authors ran it and saw it doesn't work or they're playing with us. Not a good look. The reviewers shouldn't have accepted the paper in this state without an explanation of why the authors can't do this.

This is the question that determines if this work matters or is useless. Publishing before knowing that isn't responsible on anyone's part.


After Llama 3, does this paper’s result seem so far-fetched? That 8B parameter model showed that most of what the frontier models “know” can be represented much more compactly. So why couldn’t it be represented at low precision?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: