Hi, that's my blog post. I'm pretty sure about what I wrote here, but may need the authors to chime-in in case I am missing something. I just submited an issue on github, https://github.com/bazingagin/npc_gzip/issues/3
You may want to consider adding a note to the top. Seems like a lot of people are lazily skimming/reading the headline and see it as "gzip paper full of beans, gzip approach sucks" when really I see this as "gzip approach not better than dnn models but mostly competes and much cheaper to run". The paper is still solid.
>gzip approach not better than dnn models but mostly competes and much cheaper to run
Does it? It looks to do worse than FastText in all benchmarks and kNN is not a cheap algorithm to run so it might actually be slower than FastText.
edit: It looks like FastText takes 5 seconds to train on the Yahoo Answers data set while the gzip approach took them 6 days. So definitely not faster.
I'm not familiar with most of these models in detail, but training time is generally less interesting than inference time to me. I don't care if it takes a month to train on $10k of gpu rentals if it can be deployed and run on a raspberry pi. I should definitely look into fasttext though.
As described in the paper, it didn't look like the gzip classifier trained at all. Inference involved reading the entire training set.
One could surely speed this up by preprocessing the training set and snapshotting the resulting gzip state, but that wouldn't affect the asymptotic complexity. In effect, the number of parameters is effectively equal to the size of the entire training set. (Of course, lots of fancy models scale roughly like this, too, so this isn't necessarily a loss.)
The gzip approach is much slower at inference time because you need to compute the gzip representation of the concatenated strings (query + target). Intuitively, this should be significantly more than a dot product of two embedding vectors.
The latter depends very strongly on how much computation is needed to compute those embedding vectors.
If you run a GPT-3.5-sized mode to compute that embedding (which would be a bit absurd, but if you really want GPT-3.5-quality classification, you may well be doing something like this), you're looking through quite a few tens of billions of parameters and doing a correspondingly large number of FLOPs, which could be just as expensive as running gzip over your whole (small, private) training set.
no, because the compute intensity scales with the number of classes which you wish to classify to. if you have n classes, you need to do n gzip compressions at inference time. in the embedding world, you only call the embedding model once on insert, and only need to dot product at inference time.
the same logic extends to using a self-hosted embedding model, which tend to be as good as Ada on most benchmarks, and yes, can be finetuned over your private data.
>The latter depends very strongly on how much computation is needed to compute those embedding vectors.
Sure but the gzip metrics are worse than FastText which computes the embeddings in essentially no time. Tokenize, lookup embeddings by token id, and then do some averaging. So compared to that the gzip approach is very slow.
Sure but it's existence means the statement is really "gzip approach not better than dnn models, and doesn't compete or be cheaper to run than previous models like FastText." That's not a very meaningful value statement for the approach (although why gzip is even half-decent might be a very interesting research question).
I honestly don't know why anyone would use this gzip approach in production. If you want to do text classification, really the two options you should consider are a best in class linear model like confidence weighted linear classification by Crammer (https://www.cs.jhu.edu/~mdredze/publications/icml_variance.p...) or a much more expensive LLMs.
Do you happen to be familiar with audio classification? There's been a ton of research on text classification and prediction but not many good papers I've seen for general audio classification. I'm talking more feature extraction, not speech recognition. There are a lot of speech recognition papers. So far I've been stuck on fft - image processing pipeline but I haven't gotten great results in real world tests, only on nice teat datasets.
Personally I don't have much experience working beyond mlp/rnn/lstm/cnn models.
Don't look at it as a suggestion to use gzip in production, but an invitation to reconsider the unassailable superiority of BERT over simpler, tailored solutions.
I don't think anyone actually doing NLP research has thought that BERT is always better than simpler methods. Linear classifiers with ngrams, or even better, large margin linear classifiers, are well known to be competitive with things like BERT on a variety of tasks, with orders of magnitude better runtime.
In contrast, this gzip technique is considered a cute application of information theory, but even in academia is rarely included in studies because there are simpler and better techniques for NLP.
Yes, if you are chasing the ultimate accuracy, then using a LLM (not necessarily BERT either) is going to be the best. But for a practical system trading some accuracy for vastly improved runtime is usually a very good trade-off. And again, it depends on your domain. Topic classification, stick with a linear model. Sentiment analysis? Ok, here a LLM actually gives substantially better results so it's worth the extra cost if sentiment is crucial to your application.
I personally like the CW algorithm I mentioned because it's relatively easy to implement and has excellent qualities. But if I were a dev looking for a ready to go already implemented production system I'd go for vowpal wabbit and move up to a LLM if I'm not getting the accuracy I need for my application.
Is it really an invitation? The paper shows that the current models are worse for some marginalized languages that are used as OOD datasets. I am not really that surprised that the modelals don't speak those and I don't know anybody who would use BERT like that
But FastText (2015) already exists and beats this gzip approach on all criteria. So the invitation has already existed before BERT (2018) and continues to exist.
Claims in the abstract and claim 3 in the paper, as well as much of the publicity around the paper is just wrong.
It takes gzip from being great out of domain to being middling at best. It goes from something really interesting to a "meh" model. The main part that was intellectually interesting is how robust gzip is out of domain, if that's gone, there isn't much here.
If I was the reviewer for this paper, this would take the paper from an accept to a "submit to a workshop".
Hi, I'm the first author of the paper and I read your blog post. The reason I chose k=2 is because it's recommended to use n^{1/2} and I wanted to pick a k that's compatible with 5-shot setting. But you are right this is a bit odd. As I mentioned in both the paper and the twitter, different values of k will affect the result and I reported the _max_ result we can get so it does mean an ideal situation when the guess is always right. I also use this strategy for W2V and SentBERT. But it doesn't make the result top2 accuracy. As far as I know, top2 accuracy means in the top2 predicted classes, either one is correct, we score. However, as you mentioned, in knn when k=2, there is a situation when the 2 nearest neighbours point to the same class - we are missing another class candidate if reporting top2 accuracy. But I would love to add results for different strategies and different k values when I upload newest version to arxiv when I have time. The strategy of "decrement" you mentioned in the blog is really nice and I would love to add it to the repo if you want.
I apologize for the short and late reply; I haven't checked the repo yet. I'm preparing for tomorrow's thesis defence and will reply & resolve the issue when I finish.
One question, did you try to replicate the other result table (Table 3)?
If I understand correctly, top-2 accuracy would be 1 if you have only 2 classes, but it will differ from "normal" accuracy less and less as the number of classes increases (on average). So this shouldn't change the results for table 3 thaaat much as the datasets have large amounts of classes (see table 1).
In any case, top-2 accuracy of 0.685 for the 20-newsgroups dataset is pretty neat for a method that doesn't even consider characters as characters[1], let alone tokens, n-grams, embeddings and all the nice stuff that those of use working on NLP have been devoting years to.
[1] In my understanding of gzip, it considers only bit sequences, which are not necessarily aligned with words (aka. bytes).
I haven't yet replicated Table 3 because most of those datasets are much larger and it will take awhile to run (they said the YahooAnswers database took them 6 days).
Also, I have only tried the "gzip" row because that is all that is in the github repo they referenced.
Yeah, you're right, the more classes there are, probably the lower the effect this will have.
We're adult enough to have discussions like this in public. They are healthy to have. People make mistakes. Kudos to the original authors for releasing the source code so people could inspect and replicate their results.
I agree, and just want to add: nowadays it's super common for researchers to widely publicize their new work on social media. The blog post here even mentions "Table 5 from the paper was often included in tweets".
In this context of sharing your results very publicly, it seems only fair that rebuttals would be very public, too. Otherwise researchers would be highly incentivized to very publicly publish weak results because they would get a lot of positive publicity when they share the results, but not much negative publicity when the weaknesses are found and shared.
I did not, but I see why that could be a better approach. I mainly am trying to be more open with little side projects I do, so wanting to start blogging what I'm working on. Also, this paper was beiung widely discussed so thought this would be one more entry in that.
>k=2 is a bit of an odd choice for kNN classification
That's an understatement. Choosing an even number for any sort of voting algorithm doesn't make much sense, choosing 2 specifically probably makes the least sense of all.
Yeah that is a big red flag - as the OP mentions, there is basically no way of making k=2 statistically different from k=1, that's why nobody uses it.
I suppose the authors just tried many different k and selected k=2 because it performed surprisingly well (likely due to the bug the OP found out). But if the results were significantly better than k=1 or k=3, it's a bit weird the authors never double checked why that was the case. I guess it can be one of those things you settle on early in the overall process with a few experiments, and just take for granted afterwards, never checking it again, but still, it sounds like something that should pop out at some point while writing the paper...?
Yeah, I think this is one part where a reviewer or advisor could have focused questions.
There is a sentence in Appendix C: "We set k = 2 for all the methods on all the datasets and we report the maximum possible accuracy getting from the experiments for each method."
I'm not sure what the second part of that means exactly.
Yes, agreed. One small point: for the multi-class case (more than just two classes), which include all the datasets here, you can still get ties for odd k. e.g. k=3, you can get 1 vote each for 3 different classes, etc.