I don't believe that quantization comes for free. Someone made the observation that llama3 models quantize "worse" than llama2 models do -- they suffer in quality from quantization far more.
My intuition is that a model which is undertrained suffers less from quantization, because the training process has not utilized each weight to its full potential. One of the key findings with llama, and why it punches above its weight for its size, is that they trained it for longer on a much larger dataset then was "optimal" according to the literature up to that point.
Putting two and two together, it seems that:
small model, lots of data, long training > large model + quantization
That basically, quantization is a lossy shortcut to the tail of training long. Amount and quality of data is, as always, the most important part about all of this.
While llama3-8b might be slightly more brittle under quantization, llama3-70b really surprised myself and others[1] in how well it performs even in the 2..3 bits per parameter regime. It requires one of the most advanced quantization methods (IQ2_XS specifically) to work, but the reward is a SoTA LLM that fits on one 4090 GPU with 8K context (KV-cache uncompressed btw) and allows for advanced usecases such as powering the agent engine I'm working on: https://github.com/kir-gadjello/picoagent-rnd
For me it completely replaced strong models such as Mixtral-8x7B and DeepSeek-Coder-Instruct-33B.
How does it compare against unequalised Llama 3 8B at 16fp? I’ve been using that locally and it’s almost replaced GPT4 for me. Runs in about 14GB of VRAM.
What, specifically, are you asking of these LLMs? "creative tasks" can be anything from programming to cooking recipes, so a tiny bit more specificality would be appreciated :)
I've used pretty much every major LLM out there for a specific type of creative writing, and none of them are as good at it as GPT4 with the exception of maybe Claude (Opus is actually probably even better regarding the sterility). Llama 3, even 70b, is definitely not better by any measure of actual quality - it's more random, at best.
Artificial neural networks work the following way: you have a bunch of “neurons” which have inputs and an output. Neuron’s inputs have weights associated with them, the larger the weight, the more influence the input has on the neuron. These weights need to be represented in our computers somehow, usually people use IEEE754 floating point numbers. But these numbers take a lot of space (32 or 16 bits).
So one approach people have invented is to use more compact representation of these weights (10, 8, down to 2 bits). This process is called quantisation. Having a smaller representation makes running the model faster because models are currently limited by memory bandwidth (how long it takes to read weights from memory), going from 32 bits to 2 bits potentially leads to 16x speed up. The surprising part is that the models still produce decent results, even when a lot of information from the weights was “thrown away”.
Has anyone attempted to compress network size further by extracting the symmetry invariants of the network—e.g., those that correspond to the permutation invariance of node shufflings that leave the DAG unchanged?
I did a rough calculation, and as the precision of the scalar weights decreases, the information content of the specific network permutation becomes a much higher percentage of its overall size.
Depending on the particular neural network architecture, there may be other symmetries beside the symmetric group that also represent compressible redundancies.
For a 4096x4096 matrix its symmetry group has size 4096!. The Stirling’s approximation gives ln(4096!)=4096ln(4096)-4096 which is about 10~11 bits per 4096 numbers. This is less than 0.003 bit per parameter saved.
depends on how far. Q8's are pretty much on par with f16/f32 weights. Q5/Q6's if they drop off barely drop below 2%. folks are running Q2, Q3, etc due to being GPU poor, of course it's going to be terrible. When you want to run a 70B model that's originally 140gb and all you have is 16gb of vram.
I've been thinking about how far we've come with large language models (LLMs) and the challenge of making them almost perfect. It feels a lot like trying to get a spaceship to travel at the speed of light.
We’ve made impressive progress, getting these models to be quite accurate. But pushing from 90% to 99.9999999% accuracy? That takes an insane amount of data and computing power. It's like needing exponentially more energy as you get closer to light speed.
And just like we can’t actually reach the speed of light, there might be a practical limit to how accurate LLMs can get. Language is incredibly complex and full of ambiguities. The closer we aim for perfection, the harder it becomes. Each tiny improvement requires significantly more resources, and the gains become marginal.
To get LLMs to near-perfect accuracy, we'd need an infinite amount of data and computing power, which isn't feasible. So while LLMs are amazing and have come a long way, getting them to be nearly perfect is probably impossible—like reaching the speed of light.
Regardless I hope to appreciate the progress we've made but also be realistic about the challenges ahead. What do you think? Is this a fair analogy?
"Data" isn't an inexhaustible resource, and also isn't fungible in the way energy is. Of the thousands of languages in the world, a fair chunk don't even have writing systems, and some have very few speakers left. Many are lost forever. Now ask the best llm trained on "all the data" to translate some fragment of some isolate language not in its training set and not very related to existing languages. You can't improve on that task by adding more sentences in English or by combining with learning on other modalities.
> Now ask the best LLM trained on "all the data" to translate some fragment of some isolate language not in its training set and not very related to existing languages.
If you give them the dictionary and grammar book as in-context instructions, it can do pretty well.
“Gemini v1.5 learns to translate from English to Kalamang purely in context, following a full linguistic manual at inference time. Kalamang is a language spoken by fewer than 200 speakers in western New Guinea. Gemini has never seen this language during training and is only provided with 500 pages of linguistic documentation, a dictionary, and ~400 parallel sentences in context. It basically acquires a sophisticated new skill in the neural activations, instead of gradient finetuning.”
Synthetic data might be the answer if you're fine with any data, but I haven't came across many synthetic datasets that are of high quality, and if you want high quality output from a LLM, I'm not sure Tiny Stories et al can provide that.
> Once, there was a girl who wanted to write a story. She thought and thought about what she could write about. She felt it was too boring to just write about trees and flowers. Suddenly, an idea came to her. She decided to write about her waist. She started to write about how her waist was round, and how it jiggled when she danced. Her story was so fun and exciting! She wrote about how she liked to put a belt around her waist and how it made her feel smarter. She even wrote a rhyme about her waist: "My waist is round and jiggly, And when I dance, it's so wiggly." The girl was so proud of the story she wrote. She was no longer bored - writing about her waist was much more fun!
Hardly high quality "story", and an LLM training on data like that won't have high quality output no matter how much you train it.
Edit: Another example from Tiny Stories, just because how fun they end up being:
> One day, a little boy named Jack was playing in his room. He decided to go and sit on his favourite chest. When he sat down, he noticed something unusual. The chest smelled smelly! Jack had never noticed a smelly smell before and he couldn't work out what it was. Jack's Mum heard him say 'That chest smells smelly', so she came into his room to see what was happening. When she saw the chest, she knew what was wrong. Jack's little puppy had been using the chest as a bed! His Mum scooped the naughty puppy up in her arms and took him outside. When the puppy was outside, the smelly smell went away. Jack was so relieved! He sat back down on the chest, and said 'Ahhh, much better!'
Do people really expect to be able to train on this and get high quality output? "Garbage in, garbage out", or however that goes...
>This raises the question of whether the emergence of the ability to produce coherent English text only occurs at larger scales (with hundreds of millions of parameters or more) and complex architectures (with many layers of global attention).
>In this work, we introduce TinyStories, a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4. We show that TinyStories can be used to train and evaluate LMs that are much smaller than the state-of-the-art models (below 10 million total parameters), or have much simpler architectures (with only one transformer block), yet still produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate reasoning capabilities.
The point of TinyStories isn't to serve as an example of a sophisticated model, but rather to show that the emergent ability of producing coherent language can happen at smaller scales, and from a synthetic data set, no less. TinyStories is essentially the language model equivalent of a young child, and it's producing coherent language -- it's not producing grammatically correct nonsense like the famous "colorless green ideas sleep furiously" phrase from Chomsky.
>but I haven't came across many synthetic datasets that are of high quality
I'm not really sure what your personal experience has to do with the viability of synthetic data; it's already been proven to be a useful resource. For example, Meta directly stated this upon the release of their Llama 3 model:
>We found that previous generations of Llama are good at identifying high-quality data, so we used Llama 2 to help build the text-quality classifiers that are powering Llama 3. We also leveraged synthetic data to train in areas such as coding, reasoning, and long context. For example, we used synthetic data to create longer documents to train on.
It's grammatically correct. Correct grammar despite of it being semantically nonsense, still not defined how small it can get. GPT-2's grammar was atrocious.
But it still might be worth it. A 90% accurate model will only successfully complete a task consisting of 10 subtasks 0.9^10 = 35%of the time, while a 99% will do so 90% of the time making the former useless, but the latter quite useful.
Yes, but a 90% accurate model that's 10x faster than a 99% can be run 3x to achieve higher accuracy while still outperforming the 99% model, for most things. In order for the math to be in the big model's favor there would need to be problems that it could solve >90% of the time where the smaller model was <50%.
So far experiments say yes, with an asterisk. Taking ensembles of weak models and combining them has been shown to be able to produce arbitrarily strong predictors/generators, but there are still a lot of challenges in learning how to scale the techniques to large language models. Current results have shown that an ensemble of GPT3.5 level models can reach near state of the art by combining ~6-10 shots of the prompt, but the ensemble technique used was very rudimentary and I expect that much better results could be had with tuning.
Yes and no. We don't need an insane amount of data to make these models accurate, if you have a small set of data that includes the benchmark questions they'll be "quite accurate" under examination.
The problem is not the amount of data, it's the quality of the data, full stop. Beyond that, there's something called the "No Free Lunch Theorem" that says that a fixed parameter model can't be good at everything, so trying to make a model smarter at one thing is going to make it dumber at another thing.
We'd be much better off training smaller models for specific domains and training an agent that can use tools deepmind style.
> The problem is not the amount of data, it's the quality of the data, full stop. Beyond that, there's something called the "No Free Lunch Theorem" that says that a fixed parameter model can't be good at everything, so trying to make a model smarter at one thing is going to make it dumber at another thing.
My understanding is NFL only applies if the target function is chosen from a uniform distribution of all possible functions — i.e. the "everything" that NFL says you can't predict is more like "given this sequence from a PRNG (but we're not telling you which PRNG), infer the seed and the function" and less like "learn all the things a human could learn if only they had the time".
I think you are probably right, but if humans are at 99.9% (which seems very unlikely) I don't think it will be long before you can trust a model more than a human expert.
Really though I think this line of thinking is better to revisit in 5 or so years. LLM's are still very new, and seemingly everyday new optimizations and strategies are being found. Let's at least hit a plateau before assessing limitations.
>I don't think it will be long before you can trust a model more than a human expert.
You will never be able to trust a LLM more than a human expert. Because human expert will use the best available tools (for example, LLMs), will understand "what the client wants" and will put the data in the right context. At best human expert and LLM will be indistinguishable, but I really doubt it. And I think it will take a long time.
> Because human expert will use the best available tools (for example, LLMs), will understand "what the client wants" and will put the data in the right context.
You're not wrong, but when that happens, does it still count as "a human expert" doing it? A chess grandmaster is capable of using Stockfish, but it's not their victory when they do.
There’s also a rumor that models these days employ a large “safety” parachute behind their engines all the time. Some of these get so big that models become dumber right before your eyes.
I've said this before but (as a noob) I don't think cramming all human knowledge into a model is the correct approach. It should be trained enough to understand language so that it can then go search the web or query a database for answers.
The more certain the domain, the more that is possible. If you have a document database that you trust, great. For example a support desk's knowledge base. And especially if you have an escape valve: "Did this solve your problem? If not, let's escalate this to a human."
But if you are searching the Internet, you'll find multiple answers — probably contradictory — and the next step is to ask the model to judge among them. Now you want all the intelligence you can muster. Unless you really trust the search engine, in which case yeah a small model seems great.
Do we know that reasoning ability and inbuilt knowledge are coupled? It seems to me that having the reasoning ability sufficient to judge between search engine results might want a significantly different type of training than collecting facts.
Past 99%, what does "more accurate" mean? I think it will vary from person to person and use case to use case, which is why I personally don't foresee a world where an LLM or any form of AI/ML is ever perfectly accurate.
I'm struggling to think of any medium that has ever reached 100% accuracy, so to target that for an ML algorithm seems foolhardy
I agree with this. Because it does seem that if it's based on NOT 100% accurate information in terms of training, it can never return 100% accurate results. Which I guess, as humans, we don't either, but as a committee, one MAY argue we could. I'm torn lol.
Yeah LLM's are just a nontrivial stepping stone. Humans don't need to consume the entire set of worlds knowledge, repeated from thousands of different mouths coming from different angles, to be able to learn to output human like thought processes.
At some point we'll discover a mew algorithm/architecture that can actually continuously learn from its environment with limited information and still produce amazing results like us.
Well let's not forget that the large amount of information they ingest also leads to a superhuman level of knowledge though I guess for certain kinds of agents that is not really needed anyway.
Post-training quantization doesn't come for free, but the pretraining on constrained precision weights actually counterintuitively results in a performance increase per parameter as the number of parameters grows in the ternary BitNet paper.
Even if there was zero efficiency gain in ternary weights, large models should probably be trained on a network of precision limited weights from here on out given the research so far.
I suspect it relates to each weight relating to multiple 'features' in the network. The greater the precision, the more room it gives for competing features to compromise on node values that aren't best for either feature instead of reorganizing the feature mapping to avoid conflicts.
The number of bits used per weight during training, could be included in the regularization, perhaps?
For instance, one could extend dropout regularization to several levels, where each weight could have random chances to include the most significant 2-16 bits part of the time (and still 0 part of the time), and where the impact on the gradient of having fewer bits could be used to tune the ideal number of bits for each weight.
Then one could add L1 regularization for the total number of bits used to squeeze the total down down to whatever size one aims for.
> I'm surprised the comments here are so negative / cynical.
And by "the comments" I'm assuming you mean the top comment [0] (and maybe its replies)? The rest don't really come off as negative at all, and you quoted directly from that top comment.
FWIW, I don't find that comment either negative or cynical. It starts out with that sentence you quoted, but it goes on to make a very interesting point about quantization most likely working best for models which are undertrained—models which store less information than their architecture size would suggest. That's a very valid point that I found insightful and interesting, not cynical.
I totally agree with your statement. I consider LLMs imprecise in any case, it's not a perfect/exact science, just statistics. I only use LLMs for tasks where an error margin can be allowed.
Several years of ML research for CNNs indicate at least that one can do very well with 8 bit integers. Such quantization is basically standard now for any deployment outside of GPU (where 8bit isn't any faster anyway due to the hardware).
I feel like we are getting closer and closer to a finding the Goldilocks Llm, that with some smarter training and the right set of parameters, will get us close to got 3.5 turbo performance but at a size, cost, and time effort that is significantly lower and that is runnable locally.
Combine that with what seems like every chip adding a neural engine and it feels like we are in the early days of high performance graphics again. Right now we are in the unreal engine voodoo era, where graphics cards/neural engines are expensive/rare. But give it a few generations and soon we can assume that even standard computers will have pretty decent NPUs and developers will be able to rely on that for models
Is this the level of performance people are relying upon? While I've always been impressed with the technology itself, it's only starting with GPT 4 that I think it approaches adequate performance.
For the work that I do (which is mostly rag with a little bit of content generation) GPT 3.5-turbo-0125 with a 16k context window is the sweet spot for me. I started using the api when it was only a 4k context window, so the extra breathing room provided by the 16k context window feels cavernous. Plus the fact that it's $0.50 per 1Million tokens means that I can augment my software with LLM capabilities at a cost that is attractive to me as a small time developer.
The way I rationalize it is that using 3.5-turbo is like programming on an 8-bit computer with Kilobytes of Ram, and gpt-4o is like programming on a 64bit computer with a 4080 ti and 32gb of ram. If I can make things work on the 8-bit system, they will work nicely on the more powerful system.
3.5-turbo performance wasn't very good though, and according to API statistic analysis it's a Nx7B model so it's already rather small. Ultimately Llama-3-8B is already better in all measurable metrics except multilingual translation, but that's not saying much.
It's not called anything until the lmsys leaderboard ranks it. Microsoft's blatant benchmark overfitting on Phi-2 makes for very little trust in what they say about performance. As a man once said, fool me once, shame on you, fool me twice-can't get fooled again.
Lots of folks have the intuition that, 'slightly worse' LLM models means unacceptable rates of nonsense answers.
What's an acceptable rate? Is it 99%? 99.9%?
The closer it gets to 99.999% good answers, the more damaging the wrong ones become. Because people have been trained, too. Trained to trust the answers, which makes them lazy and vulnerable to lies.
They keep calling them 1-bit LLMs, but they're really 1-trit LLMs. If you can have 3 states, it's not a BInary uniT, it's a TRInary uniT.
I don't think that this is just a nit, it implies a real mismatch between how these models work, and how modern computing hardware works. People have built ternary computers in the past, but to the best of my knowledge nobody's made a serious attempt at it in half a century.
You can always use two bits to store a ternary value, of course. But then it wouldn't be a 1-bit LLM, now, would it? And that doubling in compute resources required would make a tangible difference in how one has to think about potential efficiency improvements. Also, ternary tritwise logic implemented on binary hardware is unlikely to be anywhere near as efficient as binary bitwise logic on hardware that was specifically engineered for the purpose. This leaves me thinking that the research teams involved continuing to refer to these as 1-bit LLMs must be interpreted as knowingly engaging in dishonest marketing tactics.
I don't think it's quite so clear cut, and the papers and names are often a bit more precise anyway.
BitNet is (I think) actually one bit per parameter. BitNet 1.58b isn't, but then to be fair isn't describing itself as a 1 bit llm. I'm less sure but it seems OneBit is 1 bit. One of the processes mentioned in the article is a mix of one and two bits for the weights.
If you have a trillion parameter 8-bit fp network or a trillion parameter 1.5-bit ternary network, based on the scaling in Microsoft's paper the latter will actually perform better.
A lot of the current thinking is that the nodes themselves act as superpositions for a virtualized network in a multidimensional vector space, so precision is fairly arbitrary for the base nodes and it may be that constraining the individual node values actually allows for a less fuzzy virtualized network by the end of the training.
You could still have a very precise 'calculator' feature in the virtual space no matter the underlying parameter precision, and because each parameter is being informed by overlapping virtual features, may even have less unexpected errors and issues with lower precision nodes.
I believe you four are talking about different things; the models are executed on very good "calculators" (if you want to call the GPUs that), but themselves are not very good at being used as calculators.
LLMs are sufficiently good hammers that people see everything as a nail, then talk about how bad they are at driving screws.
People are trying to make these monolithic god models right now because everyone's chasing OpenAI's example with ChatGPT, but that approach is going to run out of steam for a lot of reasons. We're eventually going to end up with an agent model that does just what you say, recognizes specialized problem types such as math and calls the appropriate tool, such as a calculator or symbolic computation engine.
Hmm. And agent model for prompts we will ask K-12 math questions to? What are ultimately math questions often take lots of context to trudge through. It takes a solid core of natural language as well. I think there is always the need for a god model, because we want to speak in natural language as that’s easiest for us.
I’m not a theoretician nor scientist, but train models on math and logical reasoning. Some very simple word problems contains tons of context, like family trees, implicit information, etc. and the process of step by step reasoning, not just the final answer, requires even more natural language processing.
You don't need a god model to do that. You need a model that understands natural language really well and is good at identifying/extracting subtasks within a prompt, and then it needs to be able to dispatch those subproblems to the right place.
Many models support function calling/RAG already, these are very similar features from a structural POV. But of course it's harder to train a model for such tasks, compared to just fitting some existing training set.
I'm not sure having the LLM as the top level piece is the right approach though. Async is the direction we want to go, and LLMS are inherently synchronous. Additionally, LLMs are cumbersome to train and tune, having an agent that calls smaller models would unlock the power of the community to build customized models for specific needs.
We would like to have something like us. My daughter is just 3.5yrs old, she is inherently hard to train too, decades left. However, we find ourselves quite good.
You're assuming you're a monolith, but in fact you have many subnetworks in your brain. The neocortex acts to process information from other parts of the brain, and it itself comprised of a network of interacting modules.
It should just be able to program. A programming language is the ultimate verbal-to-math translation layer.
Tangent but... programming languages aren't designed for computers. Computers are perfectly happy with assembly or even binary. Programming languages are designed for humans, not just so we can see what others have done, but so that we can understand what we ourselves have done. We give the variable a name because we can't remember 0x0010101; but the computer remembers them both just fine.
Quantization is never free, and you can rest assured that even the "good" quants of the best models are highly crippled compared to their unquantized versions.
The "nearly as accurate" is only on their contrived benchmarks. I've never met a quantized model that actually behaved "98%" as good as the unquantized model, and I do LLM work daily and have since well before the ChatGPT era.
I've never met an LLM that consistently behaved well at all. Quantized or unquantized.
Honestly it feels like the bulk of the industry is acting out one big LARP where they're always right around the corner from developing AGI and doing something big and amazing and....it just never materializes. Obnoxious AI agents, worthless AI-generated websites crowding out useful resources, unreliable AI search results.
The AI industry has done very well for itself selling hype. Now it needs actual good products.
This makes 0 sense to me. I get tons of real world value and productivity benefits from AI, most specifically ChatGPT and cursor.sh.
I don't disagree that there is a ton of hype, a lot of it unwarranted, and I cringe when I see tons of companies trying to "throw AI against the wall and see if it sticks" (I personally nominate LinkedIn's AI blurbs on their feed as "most useless and annoying use of AI"). But still, I'm blown away with how much value I get from AI. It makes me a bit sad that so many of us have become so jaded that they see it as "one big LARP".
It's dismissive to call it jaded. I don't think you, me, or the people you're talking about are intellectually different. We just don't like what AI makes even if we think it's impressive.
Fair point, my apologies. I guess what I'm saying is that I agree that AI is an imperfect tool, but saying it's all hype feels like throwing out the baby with the bathwater.
If I am searching for something I don't know the answer to and don't have the luxury of trial-and-error for the information I'm given, I can't rely on an unreliable agent like ChatGPT (or literally any LLM for that matter).
ChatGPT could be giving me a correct answer. Or it could be blowing smoke up my ass.
I don't know which it is when I'm seeing an answer from ChatGPT!
What things do you work on that have meaningful consequences for failure? Woodworking? LLMs indeed might not be for you! I write code and there's zero consequences for failure as long as I don't push it in version control...
Except that's a million times slower. I'll often ask it to generate some complicated SQL queries for me (usually when I need to get info from system tables, less so my own data), and it's pretty trivially easy to verify the output. It would take me much, much longer if I had to write these queries from scratch.
I’m blown away when people say stuff like this. GPT-4 makes me substantially more productive almost every day. It’s like we live in a different reality.
>Honestly it feels like the bulk of the industry is acting out one big LARP where they're always right around the corner from developing AGI and doing something big and amazing and....it just never materializes.
It's been less than two years since ChatGPT released.
This is false. I have tested Q8s over f16 and the raw weights and pretty much seen no difference. I strictly run Q8s, I only keep the raw weights if I plan to do some fine tunes.
The performance loss from BiLLM is disastrous. It's basically useless. No one would ever want to use the resulting models. They hide their main results in the appendix: Table 8. page 15. https://arxiv.org/pdf/2402.04291
I won't go over the entire table in detail, but PIQA, BoolQ, HellaSwag, and WinoGrande should in the mid-to-high 70s for LLaMa2-7B. They drop that to 58, 62, 32, and 51. There are 700M parameter models that perform much better.
What they should have reported is effective number of parameters. Does LLaMa2-7B with their quantization method outperform a model that has amount of computation but uses that compute with say.. 16-bit quantization? If the answer is no, and it seems like it very clearly is, then the method is wholly worthless. Just use a smaller model to begin with.
The BitNet paper is better. But for some reason they only consider very small models. It's the obvious question and in their FAQ they don't provide a solid answer to it. Despite having all of the compute resources of MS. They could have easily run this experiment in the past year; I'm suspicious.
BiLLM is about post training quantization and BitNet trains models from scratch. You do realize that one of those is going to give significantly worse results and the other is going to be significantly more expensive, well into the millions of dollars?
There's no mathematical reason why doing quantization after would be worse than training from scratch. That's nonsense.
There's also no practical reason. Training quantized networks is often harder! This is why people quantize after the fact or do distillation.
Nor is there any reason why we won't find some projection of weights onto the BitNet manifold.
If it was published by academics I'd believe the cost argument.
This was published by MS. They can run this experiment trivially. I have friends at MS with access to enough compute to do it in days.
Either the authors ran it and saw it doesn't work or they're playing with us. Not a good look. The reviewers shouldn't have accepted the paper in this state without an explanation of why the authors can't do this.
This is the question that determines if this work matters or is useless. Publishing before knowing that isn't responsible on anyone's part.
After Llama 3, does this paper’s result seem so far-fetched? That 8B parameter model showed that most of what the frontier models “know” can be represented much more compactly. So why couldn’t it be represented at low precision?
This seems saying that the 1bit model is a bit better than GPTQ Q2. However, I find there are few situations where you would want to use GPTQ Q2 in the first places. You would want to run the F16 version if you want quality, and if you want to have a sweet spot, you usually find something like Q5_K_M of the biggest model you can run.
Are you sure about that?
what are the benchmarks it fails on when setup like-for-like with GPU drivers?
Even still, it can do constrained grammar and is really easy to setup with wrappers like Ollama and the Python server too. I struggled to find support for that with other inference engines - though things change fast!
I know there are 1-bit experiments out there but from what I can tell it's BitNet 1.58 that's really exciting.
So why do people keep focusing on "1-bit" when the whole reason 1-trit models are so successful in the first place just might have everything to do with the ternary weights and the symmetry they're able to encode?
I'm not sold on the suggestion that "imprecise language models" in general are "nearly as accurate" when it might actually be that ternary weights are more precise in a totally difference sense: in that they just might be capturing a minimal representation of the very symmetries responsible for making all those parameters effective in the first place.
Is there some work on how small a model with some specific epsilon perplexity could theoretically be? Given a fixed architecture and a fixed dataset, I presume there is a minimal number of parameters required for optimal representation.
If you are referring to what is theoretically possible with arbitrary computation in the model, it's called Kolmogorov complexity and it's not computable.
You can estimate it empirically. However large changes in model parameters/capacity tends to interact with hyperparameters, so would want to do runs with multiple values of hyperparameters. And training processes give noisy results, so might want to do multiple repetitions.
And each run may take several GPU days. So even a small experiment of 10 repetitions X 10 hyperparameters X 10 model sizes takes several thousand GPU days. But there are many papers from the large labs that do such.
And the whole result is also conditional on the optimization/training process used. Which is an area where we have no reason to think that we are optimal... So we can do studies with practical results (given sufficient money), but we are far from being able to identify the actual maximums available.
The closest research would be the Chinchilla scaling laws, which estimates the final loss as a function of number of parameters and tokens. Set the number of tokens to infinity would give a good estimate of minimum achievable loss.
Make a tiny language model and then use it for autocorrect & autocomplete. Current autocorrect & autocomplete on smartphones is so bad that it spawns countless jokes. This use case doesn't require a large language model able to write whole paragraphs -- it only needs a model that's just barely big enough to make contextually appropriate suggestions of one or a few words.
I study LLM quantization and I have surveyed GPTQ and QuIP# and lots of quantization algorithms (specifically PTQ, post-training quantization) to develop my own, and my experience has led me to become extremely skeptical of many of the papers.
I've seen lots of headlines like "1-bit quantization" (including this one and https://arxiv.org/abs/2310.16795). What I've found in this space is that the headlines can often be intentionally misleading about what is actually achieved. If you read closer the abstract of this paper, it claims 8.41 perplexity on LLaMA2-70B at 1 bit, which is a HUGE decrease from 3.120 perplexity in FP16 and they will never mention that in the headline. Even LLaMA2-7B at INT8 achieves 5.677 perplexity with half of storage place (better with LESS space and LESS training). Some claim 1.58-bit quantization (each weight is either -1, 0 or 1) but in practice require very small group sizes, which means one or two extra FP16 numbers for every 64 weights and that means another 0.5 bit, so it's actually 2-bit quantization. And every quantization algorithm can claim they make language models smaller, speedier, and use less energy, so there's nothing special about these.
Here are the key metrics that I suggest checking when comparing quantization schemes:
* Perplexity. Note that it also depends on the dataset (either WikiText2 or C4, WikiText2 numbers are usually lower than C4) and context size (1024, 2048 or 4096, higher context sizes usually means less perplexity). Dataset and context size must match to make a meaningful comparison.
* Quantization bits. Many algorithms claiming 2-bit or 1-bit quantization has lots of extra parameters elsewhere, such as grouping. Download the quantized version of the model, check its file size, multiply by 8 and divide by the number of parameters. That gets you the ACTUAL quantization bits.
* Performance. Weights may need to be dequantized during inference which could introduce overhead. Some libraries have custom matmul kernel for dequantization that achieves performance close to FP16, others can be slower. Check its generation speed and inference speed.
Newer architectures such as Ampere contains INT8 cores which may make quantized version even faster than FP16, I haven't tried out yet.
There is also a lot of misleading comparisons in this space. Some methods like GPTQ only provide vector-matrix multiplication kernels, which means only a single token can be generated and batched inference (which is needed for generating initial KV cache, or for serving multiple users) can be much slower. If an algorithm claims a 3x speedup for something, check if they refer to single stream latency or multiple stream throughput. Some of that speedup comes from running a model on 2 cards instead of 5 cards without specifying if the cards have NVLink configured (you shouldn't run inference on multiple cards without NVLink, or you should expecet huge slowdown simply because of using 5 cards).
* Base model. Pick a STRONG base model like Llama-2-7b or Llama-3-8b etc. Not an undertrained model like SwitchTransformer etc which may have lots of redundant parameters in itself.
My personal favourite remains QuIP# (https://github.com/Cornell-RelaxML/quip-sharp). It lacks in the "performance" part as its matrix multiplication performance isn't on par yet but there is room for improvement, and it wins every other metric. And sad news: it's very likely we won't have practical 1-bit LLMs, never ever. We are reaching the end game between 2.5~4 bits. By "practical" I mean it should beat 3-bit LLMs with 3x less parameters or 2-bit LLMs with half as many parameters. There is a Shannon limit to quantization whatever methods you use.
Completely agree on PTQ, but curious on your thoughts for QAT, specifically BitNet 1.58 - in that paper it looks like parameter to parameter the constrained precision weights had improved perplexity vs floating point weights, particularly as the model size increased.
While I'd love to see it scaled up to at least ~50B models, it looks like limited weight precision might actually offer improved network optimization over unconstrained weights for pretraining.
Do you think that work is misrepresenting the gains, or that QAT is a different beast where quantization isn't as much a tradeoff as a potential net gain across the board?
Can't speak for QAT as I haven't yet dived into that area. I've quickly skimmed the BitNet and BitNet 1.58 paper. I think achieving comparable performance with a Llama model with the same number of parameters is impressive but unfortunately it seems they didn't release the training code so I can only tell from their paper. Fortunately they did talk about training details in the BitNet paper (not in BitNet 1.58 so I assume they remain the same):
> Mixed precision training. While the weights and the activations are quantized to low precision,
the gradients and the optimizer states are stored in high precision to ensure training stability and
accuracy. Following the previous work [LSL+21], we maintain a latent weight in a high-precision
format for the learnable parameters to accumulate the parameter updates. The latent weights are
binarized on the fly during the forward pass and never used for the inference process.
In this case there are two areas to optimize for: training efficiency and inference efficiency.
If I understand correctly, it stores the weights, gradients and second-moment estimates in FP32 like every other mixed-precision training (the Gopher paper has details on why storing them in FP32 is important), and quantized weights are used in forward pass. What I'm not sure is whether latent weights are used in backward pass, and my instinction is that the "Straight-through estimator" requires high-precision latent weights so they may still be needed. Training FLOPS can be roughly estimated as 6 FLOP per parameter per token, where 2 is forward pass, 2 is gradient computation and 2 is gradient accumulation (see https://medium.com/@dzmitrybahdanau/the-flops-calculus-of-la...). If only forward pass is quantized, this means only 1/3 of all FLOPS are optimized (and even then it has to be accumulated in FP32). So I'm skeptical of the gains in training efficiency here, and I can't find the numbers (how much energy or how much time is used for training, compared to regular FP16 mixed precision training? The papers boast inference energy savings which makes me even more skeptical of training energy savings)
For quantization efficiency, while QAT can certainly avoid the quantization step, PTQ methods are very cheap (usually <24 hours on RTX 4090 for Llama-2-70b) so I consider the cost of the quantization step negligible. There is not much difference in inference efficiency gains as PTQ and QAT can quantize to the same format. For final accuracy, unfortunately there is a lack of comparison between QAT and PTQ of fp16 models, and PTQ has the advantage of not requiring access to the original dataset, so I think it's very hard to make a fair comparison here but it's also likely the only area where QAT has actual gains compared to best PTQ methods.
Just on your very last point, I think you've nailed why a 1-bit quant of a bigger LM can't beat 3-bit quants of an LM a third the size, if what you mean is that more extreme compression of a LM is more likely to introduce harmful artefacts, so you need a better quantisation method to produce 1-bit than you do at 3-bit to end up with a model with the same information content retained.
What I don't think that tells us anything about is directly trained 1-bit LMs versus 3-bit LMs, because in that case there's no compression step to introduce quantisation artefacts. There might be an analogous training data size argument but it's not clear to me that there needs to be: a 3X parameter 1-bit LLM and a 1X parameter 3-bit LLM ought to be equivalent in terms of their information capacity.
You all run on 2-bit (DNA) and you all seem pretty stable to me. You run your brain with couple of watts per day and add big numbers without much hassle.
Right - it obviously isn't a binary option you can pursue both in unison.
I would want to continue the statement that we are still early innings on renewable energy -- and let's keep deploying it rapidly to manage increased compute demand.
Any time we find more efficiency, we can trade it for more quality by doing more compute. We'll always use as much compute as we can afford, until we stop getting quality gains that are worth the added cost.
Isn't this article just about an optimization though, sans the title?
I don't much care for all the "oh but the energy usage" claims in most tech things: it's all electricity, and it's all fungible. It usually seems to roll out as a proxy for "I don't like this thing".
Like even with cryptocurrency, there were a lot of people mistaking the issue of scalability - namely that "as a store of value" crypto would consume incredible amounts of other resources (and a lot of people got stuck trying to figure out how somehow "a hash" could be reclaimed for useful resources) to do less then alternatives, with "the energy usage itself is the problem".
Finding optimizations for LLMs is good because it means we can build cheaper LLMs, which means we can build larger LLMs then we otherwise could for some given constraint, which means we can miniaturize (or in this case specialize) more capable hardware. The thing which really matter is, can the energy usage be meaningfully limited to a sensible scaling factor given the capability that makes them useful?
Because environmentally, I can install solar panels to do zero-carbon training (and if LLMs are as valuable as they're currently being priced, this is a no-brainer - if people aren't lying about solar being "cheaper then fossil fuels").
> Like even with cryptocurrency, there were a lot of people mistaking the issue of scalability - namely that "as a store of value" crypto would consume incredible amounts of other resources (and a lot of people got stuck trying to figure out how somehow "a hash" could be reclaimed for useful resources) to do less then alternatives, with "the energy usage itself is the problem".
To be fair, technology-wise they mostly solved this problem via proof-of-stake.
From an individual point of view you still expense enormous resources as a miner / validator in a proof-of-stake system. It's just that now the resources come in the form of lost opportunity costs for your staked tokens (eg staked Ethereum).
But from aggregated perspective of society, staked Ethereum is essentially free.
That has some parallels to how acquiring regular money, like USD, is something individuals spend a lot of effort on. But for the whole of society, printing USD is essentially free.
> Because environmentally, I can install solar panels to do zero-carbon training (and if LLMs are as valuable as they're currently being priced, this is a no-brainer - if people aren't lying about solar being "cheaper then fossil fuels").
There's still opportunity costs for that energy. Unless you have truly stranded electricity that couldn't be used for anything else.
> Finding optimizations for LLMs is good because it means we can build cheaper LLMs, which means we can build larger LLMs then we otherwise could for some given constraint, which means we can miniaturize (or in this case specialize) more capable hardware. The thing which really matter is, can the energy usage be meaningfully limited to a sensible scaling factor given the capability that makes them useful?
I agree with that paragraph. It's all about trade-offs. If we can shift the efficiency frontier, that's good. Then people can decide whether they want cheaper models at the same performance, or pay the same energy-price for better models, or a combination thereof. Or pay more energy for even better model
It’s good that we can build cheaper LLMs, but the problem companies guzzling energy for LLM training won’t use less energy, they’ll just have better models
That energy is still on the order of a household’s yearly electricity, not that of Argentina as cryptos were, and that’s just for training.
Inferring is much cheaper and arguably provides quite a lot of value (even though I also think it is overhyped), for very little energy consumption, probably more is lost due to inefficiency for any physical product.
Llama 1, 2, and 3 all have different architectures and needed to be trained from scratch.
Llama 1 was released February 2023.
Same training story for openAI’s Sora, dalle, and 4o. All of mistral’s models Mamba, Kan, and Each version of rwkv (they’re on 6 now)
Not that this list is a result of survivor bias.
It’s only looking at their published models too. Not the probably 1000s of individual training experiments that go into producing each model.
Which is still absolutely nothing compared to something like youtube’s servers, which is absolutely nothing compared to something like the food industry.
Like, if a couple of millions of people can use chatgpt in the manner they do today, would it matter if a house’s yearly energy budget was used up for that? Or 10?
If you train on a cluster that costs >$1M/day to operate, the wait time is likely to be a smaller concern than the financial cost, unless you're REALLY in a hurry to beat some competitor.
> You might be surprised to learn that I actually think LLMs have the potential to be not only fun but genuinely useful. “Show me some bullshit that would be typical in this context” can be a genuinely helpful question to have answered, in code and in natural language — for brainstorming, for seeing common conventions in an unfamiliar context, for having something crappy to react to.
> Alas, that does not remotely resemble how people are pitching this technology.
Are you going to spam this same link in every single thread about LLMs on HN? People have provided good arguments refuting whatever you're trying to say here, but you just keep posting the same thing while not engaging with anyone.
No, the answers aren't just "plausible", they are correct the vast majority of the time. You can try this for yourself or look at any benchmark, leaderboard or even just listen to the millions of people using them every day. I fact check constantly when I use any LLM, and I can attest to you that I don't just believe that the answers I'm getting are correct, but that they actually are just that.
But they apparently actually don't get better even though every metric tells us they do, because they can't? How about making an actual argument? Why is correctness "not a property of LLMs"? Do you have a point here that I'm missing? Whether or not Kahneman thinks that there are two different systems of thinking in the human mind has absolutely no relevance here. Factualness isn't some magical circuit in the brain.
> No such thing can exist.
In the same way there can exist no piece of clothing, piece of tech, piece of furniture, book, toothpick or paperclip that is environmentally friendly; yes. In any common usage, "environmentally friendly" simply means reduced impact, which is absolutely possible with LLMs, as is demonstrated by bigger models being distilled into smaller more efficient ones.
Discussing the environmental impact of LLMs has always been silly, given that we regularly blow more CO2 into the atmosphere to produce and render the newest Avengers movie or to spend one week in some marginally more comfortable climate.
No, they are not correct -- the answer it gives might accidentally be correct but it can not be trusted, you still need to do research to verify everything it says and so the only usable standpoint is to use it as a bullshit generator which it is very good at.
What's your definition of "correct" then? If a system is "accidentally correct" the majority of the time, when does it stop becoming "accidental"? You cannot trust any system in the way you want to define trust. No human, no computer, no thing in the universe is always correct. There is always a threshold.
I do research with LLMs all the time and I trust them, to a degree. Just like I trust any source and any human, to a degree. Just like I trust the output of any computer, to a degree. I don't need to verify everything they say, at all, in any way.
Genuine question, how do you think an LLM can generate "bullshit", exactly? How can it be that the system, when it doesn't know something, can output something that seems plausible? Can you explain to me how any system could do such a thing without a conception of reality and truth? Why wouldn't it just make something up that's completely removed from reality, and very obviously so, if it didn't have that?
Never. As long as it is a probabilistic token generator, it can not be correct, it's that simple.
And it creates plausible text because it is trained on what humans have produced so it looks plausible. As someone put it, they found a zero day in the OS of the human brain.
At this point, I strongly urge you to think about what could possibly change your mind. Because if you can't think of anything, then that means that this opinion is not founded on reasoning.
The text LLMs produce is not just plausible in a "looks like human text" sense, as you'd very well know if you actually thought about it. When ChatGPT generates a fake library that looks correct, then the library must seem sensible to fool people. This can't be just a language trick anymore, it must have a similarity to the underlying structure of the problem space to look reasonable.
The fact that you refuse to engage with my points tells me otherwise.
You're drawing meaningless distinctions, anyone who has ever used Cyc will tell you that it makes massive mistakes and spits out incorrect information all the time.
But that is even true of humans, and every other system you can imagine. Facts aren't these magical things living in your brain, they're information with a high probability of accurately modeling reality.
When someone tells you x happened in y at time z. Then that only becomes a fact if the probability of the source being correct is high enough, that's it. 99% of all of your knowledge is only a fact to you because you extracted it from a source that your heuristics told you is trustworthy enough. There is never absolute certainty, it's all just probability.
> Facts aren't these magical things living in your brain, they're information with a high probability of accurately modeling reality.
Truly people have completely lost it because of the AI hype.
There are facts. They are not probabilistic, they are just that: facts. Despite Mencken's 1917 long essay "A Neglected Anniversary" which became really popular, the bathtub didn't arrive to the United States in 1842 and it didn't became popular because President Fillmore installed one. A Kia ad in 2008 still referred to this without realizing it's a made up story to distract from World War I. https://chatgpt.com/c/6b1869a7-c0d7-46e9-bcb5-7a7c78dc3d53https://sniggle.net/bathtub.php
Notably in 1829 the Tremont Hotel in Boston had indoor plumbing and baths (copper and tin bathtubs) and in 1833 President Andrew Jackson had installed iron pipes in the Ground Floor Corridor and a bathing room in the East Wing. Well before 1842.
My intuition is that a model which is undertrained suffers less from quantization, because the training process has not utilized each weight to its full potential. One of the key findings with llama, and why it punches above its weight for its size, is that they trained it for longer on a much larger dataset then was "optimal" according to the literature up to that point.
Putting two and two together, it seems that:
small model, lots of data, long training > large model + quantization
That basically, quantization is a lossy shortcut to the tail of training long. Amount and quality of data is, as always, the most important part about all of this.