Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
AI is currently just glorified compression (twitter.com/chombabupe)
77 points by kklisura on Nov 24, 2023 | hide | past | favorite | 70 comments


This is covered in "Information Theory, Inference, and Learning Algorithms" by David MacKay ( https://www.inference.org.uk/itprnn/book.pdf ):

> Why unify information theory and machine learning? Because they are two sides of the same coin. In the 1960s, a single field, cybernetics, was populated by information theorists, computer scientists, and neuroscientists, all studying common problems. Information theory and machine learning still belong together. Brains are the ultimate compression and communication systems. And the state-of-the-art algorithms for both data compression and error-correcting codes use the same tools as machine learning.

* In compression, gzip is predicting the next character. The model's prior is "contiguous characters will likely recur". This prior holds well for English text, but not for h264 data.

* In ML, learning a model is compressing the training data into a model + parameters.

It's not a damning indictment that current AI is just compression. What's damning is our belief that compression is a simpler/weaker problem.


Completely agree. Every model/dimensionality-reduction is in some sense compression, isn't it?

We are taking a problem space with more parameters and detail and reducing it to a solution space with fewer parameters and possibly less detail (depending on whether the solution is exact or not which maps onto the compression being lossy or not).

If I take the set of all points that are equidistant from a given point and line, that is an infinite set. But I can compress it down to three real numbers if I know that set can be represented y = ax^2 + bx +c, and noone goes "quadratic equations are just compression".

A lot of people try to generate milage out of the idea of the compression being _lossy_ also. That's an intellectual dead end in the same way. Lots of useful models are a lossy compression.


I don’t think the statement is annoying to some because they see it as a damning indictment of AI.

It’s because it’s not literally true. If it were literally true, a zip file would be ChatGPT, but it is not. The paper was named in an intentionally provocative way.

What they really mean is, at the heart of these transformer models, compression is the fundamental principle or goal.

This trend in naming papers has evolved over time. It didn’t always used to be this way.

Some would say a provocative title is thought-provoking and drives a wider interest.

Others would say it’s a research paper and things in it should be literally true.

edit to add direct link to paper:

White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is? https://arxiv.org/abs/2311.13110


Compression and prediction are equivalent, and both measure some function of intelligence and knowledge. Current LLMs are heavy on the knowledge side, being trained on ~all the text.


I haven't read MacKay's work, so maybe this is naive, but I think the belief is that compression is a deterministic endeavor, whereas AGI may not be. The ability to recall information is incredibly useful; the ability to do so in convenient and flexible ways doubly so. The level of complexity being proportional to the GPT's ability to interpret means that compression is not necessarily a simple problem, like you say.

However, if intelligence involves nondeterministic traits, i.e. something in the vein of creation of the present, AGI could be a significantly different problem to solve than compression. I think there's at least an intuition that this is the case, which explains the belief that compression is a simpler/weaker problem.

As an aside, I'm currently unsure of my position on this.


I'm convinced that deterministic AI will never fully replicate human intelligence. I will attempt to explain my reasoning.

I can say with absolute certainty that I have subjective experience. If my behaviour is deterministic, I can imagine a philosophical zombie version of me that behaves the same but doesn't have subjective experience. That zombie, using the exact same reasoning as me, down to individual particle interactions, will (incorrectly) determine with absolute certainty that it has subjective experience and is not a philosophical zombie. Its reasoning, which is my reasoning, is thus flawed. Therefore, in order to believe my behaviour is deterministic, I must doubt my ability to reason.

(Some claim that philosophical zombies aren't possible to imagine. But I can imagine them, so if that's impossible then my reasoning is still flawed just for a different reason.)

I believe the same argument applies even if I assume my behaviour is non-deterministic but follows a probability distribution which is a function of the past -- the philosophical zombie has the same probability of reasoning that it has subjective experience as I do, and the rest of the argument is similar. Thus, I believe my behaviour is not only non-deterministic, but mathematically ineffable, and thus outside the realm of what computers can do.

If you have subjective experience which works similar to mine, maybe you can follow my reasoning. If you don't, the argument may appear to be nonsense due to the ineffability of subjective experience.


Your “experience” of subjectivity is really just an illusion.

Your subconscious mind makes very deterministic decisions milliseconds before your conscious mind is even made aware of it, and it feels like you’re deciding consciously but you’re actually absolutely not based on many studies.


And critically, those deterministic, predictable, mechanical responses still give rise to conscious experience of self (Which for me brings into doubt the possibility of p zombies).


I don’t quite understand your zombie argument but imagine it would be quite easy to inject randomness into an AI agent.


We are already doing it, as all LLMs are fully deterministic. So we are adding random seed to prompts and control temperature.


Actually, due to parallel execution they are not running deterministically. Even if the temperature is zero some randomness is left in. You can run a model deterministically, but this is incredibly slow.


psuedo-randomness.


Your argument assumes that p-zombies can exist, which is unproven and I consider highly specious.


It’s also unproven anyone is conscious except me.


> that compression is a deterministic > endeavor, whereas AGI may not be.

Fabrice Bellard famously made a compression tool using a Transformer [1]. So if transformers can be deterministic... why not AGI?

[1] https://bellard.org/nncp/


LLMs are deterministic in principle. It’s like JPEG where it’s a lossy compression plus some deliberate injection of noise to add variety.


I’m unsure human intelligence is nondeterministic.


I love David MacKay's brilliant work on the Dasher text input system, which draws deeply from his work on information theory -- imagine Dasher integrated with an IDE and code search and Copilot and language model!

"Writing is navigating in the library of all possible books." -David MacKay

We just allocate more shelf space to the more probable letters.

Why isn't Dasher built into every operating system and mobile phone?

https://en.wikipedia.org/wiki/Dasher_(software)

https://dasher.acecentre.net/about/

https://news.ycombinator.com/item?id=17105728

DonHopkins on May 18, 2018 | parent | context | favorite | on: Pie Menus: A 30-Year Retrospective: Take a Look an...

Dasher is fantastic, because it's based on rock solid information theory, designed by the late David MacKay. Here is the seminal Google Tech Talk about it:

https://www.youtube.com/watch?v=wpOxbesRNBc

Here is a demo of using Dasher by an engineer at Google, Ada Majorek, who has ALS and uses Dasher and a Headmouse to program:

https://www.youtube.com/watch?v=LvHQ83pMLQQ

Another one of her demonstrating Dasher:

Ada Majorek Introduction - CSUN Dasher

https://www.youtube.com/watch?v=SvsSrClBwPM

Here’s a more recent presentation about it, that tells all about the latest open source release of Dasher 5:

Dasher - CSUN 2016 - Ada Majorek and Raquel Romano

https://www.youtube.com/watch?v=qFlkM_e-sDg

Here's the github repo:

Dasher Version 4.11

https://github.com/GNOME/dasher

>Dasher is a zooming predictive text entry system, designed for situations where keyboard input is impractical (for instance, accessibility or PDAs). It is usable with highly limited amounts of physical input while still allowing high rates of text entry.

Ada referred me to this mind bending prototype:

D@sher Prototype - An adaptive, hierarchical radial menu.

https://www.youtube.com/watch?v=5oSfEM8XpH4

>( http://www.inference.org.uk/dasher ) - a really neat way to "dive" through a menu hierarchy/, or through recursively nested options (to build words, letter by letter, swiftly). D@sher takes Dasher, and gives it a twist, making slightly better use of screen revenue.

>It also "learns" your typical useage, making more frequently selected options larger than sibling options. This makes it faster to use, each time you use it.

>More information here: http://beznesstime.blogspot.com and here: https://forums.tigsource.com/index.php?topic=960

Dasher is even a viable way to input text in VR, just by pointing your head, without a special input device!

Text Input with Oculus Rift:

https://www.youtube.com/watch?v=FFQgluUwV2U

>As part of VR development environment I'm currently writing ( https://github.com/xanxys/construct ), I've implemented dasher ( http://www.inference.org.uk/dasher ) to input text.

One important property of Dasher is that you can pre-train it on a corpus of typical text, and dynamically train it while you use it. It learns the patterns of letters and words you use often, and those become bigger and bigger targets that string together so you can select them even more quickly!

Ada Majorek has it configured to toggle between English and her native language so she can switch between writing email to her family abroad and co-workers at google.

Now think of what you could do with a version of dasher integrated with a programmer's IDE, that knew the syntax of the programming language you're using, as well as the names of all the variables and functions in scope, plus how often they're used!

I have a long term pie in the sky “grand plan” about developing a JavaScript based programmable accessibility system I call “aQuery”, like “jQuery” for accessibility. It would be a great way to deeply integrate Dasher with different input devices and applications across platforms, and make them accessible to people with limited motion, as well as users of VR and AR and mobile devices.

https://web.archive.org/web/20180826132551/http://donhopkins...

Here’s some discussion on hacker news, to which I contributed some comments about Dasher:

A History of Palm, Part 1: Before the PalmPilot (lowendmac.com)

https://news.ycombinator.com/item?id=12306377


How should we unify these things? Most of it seems basically the same - we maximize log-likelihoods instead of calling it minimizing surprisal - but it's clearly the same thing. Are there any other ways to integrate these things together?


Yes, this is a well known thing to anyone paying attention to information theory.


Compression is abstraction.


"Jet engines are just glorified oil lamps."

Not that I think the current AI is as life-changing as is purported, but this comparison is terrible. Almost all complex software is made up of a bunch of other previous simpler technologies.


I think the point was that the current transformer model paradigm will never be able to reach AGI, no matter how far you take it. It needs something fundamentally more to be able to do that. But maybe as you say, that something will be built on top of transformer technology.


I love how I've seen Ilya talk about it. If we could find the shortest program to reproduce the training set, that would be optimal compression. But it's an intractable problem, we can't even come close, there's just no way to come at it.

But with deep learning we can instead find a circuit that approaches reproducing the training set.

This is lossy compression. There's nothing "just glorified" about it though; the result is astounding.

A more appropriate takeaway might be that sufficient compression is mind-bendingly more powerful than intuition might otherwise guess.

Does calling it "just glorified" guide any intuition that in order to compress amazon reviews a neural net is gonna have weights that correspond to sentiment? Does it tell you that such compression also ends up producing an artifact that can be put to work in a generative framework? And that it'll be a very useful generative framework because such compression required weights that correspond to all sorts of useful ideas, a compression of something more meaningful than just text?

Calling it "just glorified X" is clickbait. It's compression alright, and it's either 1) also whole lot more, or 2) compression is a whole lot more wild and impressive than you thought, or both.


The "sentiment" weights are just the supervised classification part on top of the "compression" piece. An autoencoder is closer to pure compression. SVD is compression as well. SVD is also useful to solve some equations.


but currently it doesn't seem very useful.

it hallucinates too much, there's no reflexivity, no critical reasoning. sure, nice prompts can give it some structure, and with a lot of pre- and postprocessing (ie. RAG and other ways of forcing it to pick from a list) it's good for information retrieval.


As usual, when compression is brought up in the context of AI, it seems relevant to mention the Hutter Prize:

https://en.wikipedia.org/wiki/Hutter_Prize


We also use to say that compression is the science of prediction.

Scientific models are also tools used to predict the results of experiments.

So if AI is able - through « compression » - to build models that help us predict experiments and understand the world better; it fully deserves the « intelligence » suffix.


intelligence is adaptive problem solving ability. one way of that is to observe the problem, build a model of it, come up with possible strategies, test them through simulation, and pick the best one.

current GPT with good prompts is a very nice puzzle piece of this, but not there .. yet.


This wasn't obvious to me a while ago, but I've come around to it. To me the important thing is that it's not only compression, it's lossy compression.

Models (any kind) are basically an imperfect approximation of some system, and that definition is precisely what's identified here.

You can demonstrate it by assuming some sequential process that gets an "AI" like an LLM to generate as much content as it can[1], then train a new generation model on this content. Then use the new model to generate as much content as it can, train a third generation model, and so on.

LLM->generate->LLM'->generate'->LLM''->generate''->...->LLM'...'

Since each generation may not produce content with every possible token it has been trained on, it stands to probability that eventually some tokens, especially low probability tokens, will simply be left out of the generated content. The next model will therefore lose that token and it will never be part of the content it generates, and so on.

After enough generations of this process, you eventually end up with a model with a single, or no, tokens it can use to generate text.

The problem is that the models are trained on so much information that they're effectively models of the textual content of our civilization. If we hand off the reigns to these systems, they may work well for a while, and may even seem to be produce novel ideas (but which are secretly just old ideas in combinations we haven't seen before), but they'll also produce content which will inevitably be used to train later models. As the content these models produces starts to dominate the ratio of human-text:model-text, there will be less of a reason to preserve the human text and we'll end up with the scenario above.

Things may work well for a while, perhaps a long time, but even if we don't end up with models trained on one token, we will end up with fewer and fewer ideas-as-text represented within them. Civilization will stagnate, the singularity will not commence, and things will slowly regress.

Of course this all presumes research stops, and somehow LLMs become religion and we turn society over to them. Which of course isn't what's going to happen. Is it?


I agree, however in that scenario, are people fully stopping their content generation? Because if not, the pool will grown larger and the problem you described would still be there but to a lesser extent.



think about video compression for a "zoom" call.

You could just send a lot of bitmap files. Or, you could save bandwidth and zip them before send.

Or you could integrate some image-specific compression into the file format, as in motion-jpeg. And in mpeg2/h264/h265 you supercharge this with temporal compression, not just adjacent pixels, not just blocks within the frame, but also pixels and blocks from adjacent frames are used to predict each pixel.

And now think about not sending video, just a single initial jpeg, some face shape data, and now a continuous stream of position data of a dozen or so points on your face and the facial movements are reconstructed. (Nvidia research project from last year)

And now think about no longer sending images at all, just coordinates for points in latent space (aka very abstract image descriptions) and a super-fast diffusion model "generates" the right frames on the fly.

Where does "compression" end, where does "AI" start? Ultimately, the marketing people decide.

Don't get hung up on the term "AI". "AI" is a bullshit hype buzzword. Calling it such serves no scientific-practical purpose, it is solely meant to impress venture capital.


Once again, so does the human brain. You aren't remembering a high bandwidth stream of raw sensory input but a distilled essence of that.


It's fun to see people endlessly make this point while providing no mechanisms for how it works. We might say "it surely must be this way" but we haven't shown it affirmatively in any sense. Until a mechanism is shown, I think it's really irresponsible to make statements like this. You really have no clue what it's doing.


We know memories are essentially recreated when remembered, we know that eye-witness testimony is spectacularly prone to hallucinatory-details-in-the-llm-sense, we know that focusing attention on one aspect of a video will cause participants to simply not see remarkably obvious and noteworthy details of that video. No, I think it's safe to say that it's been shown pretty affirmatively.

Wings were inspired by birds, and although they mechanically function in a manner entirely unlike a bird's wing, the aerodynamics of aircraft wings fully explain why birds can fly.


Bird wings are actually quite a bit more sophisticated than aircraft wings. Don't let that distract you from your point about belittling human cognition though.


Did he say otherwise ? The point is...who cares ? Planes fly and they so very very well. They even fly faster than any bird will ever be able to. So who cares that bird wings are "quite a bit more sophisticated". Sophistication is not and never was the goal.


We should care because we're all meat. Do you want to live in a society where people are devalued because a computer program approximates memory (badly)? Humans ought to be treated exceptionally regardless of how good a computer is able to do things. Computers are already ridiculously better at arithmetic than us, that doesn't devalue human rights. If you think you've got a system close to cognition, you better be god damn sure because the alternative is an enormous amount of human suffering.


Yes, that was exactly the point. Technology developed in imitation of a much more sophisticated and incompletely understood animal example, for which a more complete theory was developed. And that more complete theory eventually became sophisticated enough to comprehensively explain the biological artifact.

(And in spite of that, we still keep birds around, because not everything of value about birds is captured by an aerodynamic analysis of a bird's wing. Obviously.)


There's no requirement that a mechanism be known in order to show that something is true.

There are lots of other ways of deriving and inferring facts. There's nothing irresponsible about it whatsoever.


It does feel like if the "fact" being shown is that our basic cognition is "like" a computing system, the burden of proof is a lot higher. When we make statements about how people are "just" doing compression in their brains, that's incredibly dehumanizing. Be careful with this line of thinking, it doesn't lead to nice places.


I mean, surely you can just apply your very own experience to the matter, right? Try this: remember what you were doing at this time last week, down to the minute, as though you were currently experiencing it. Every sight, every sound, every movement and every sensation.

Can you easily do that? What were your eyes focusing on as they moved 2-3 times per second on average? Can you remember each movement? But okay, you say, you weren't actually learning or "focused" or whatever at that point, so that doesn't count. Alright, try this: recall the times table you learned in grade school. Go ahead and say them out loud in whichever order you memorized them in.

When you were doing that, what was happening in your mind? Did you picture that crusty old laminated times table that was up on the wall across from you, hearing the sounds of the other children writing and moving in their seats? Did you feel the hard plastic seat under you as you desperately tried to instill that 7x12 was 84?

Or did you just remember the abstract concept?


Distilled and reconstructed with hallucinations.


So good memory just means you have a good compression algorithm?

I guess if you remember "Sam said a rude thing today about my work" instead of exactly what was said it is very hard to clear up afterwards, and you will hallucinate a lot more rudeness than was actually there.

If the compression algorithm for how to store such memories are different in every person it explains a lot of misunderstandings.

Edit: However not many would say that good memory means you are smart, some people remember a lot of things very well but are still dumb in practice. So I don't think that compression is enough to define intelligence.


On this, typically people have terrible memories, especially of one off events. This is why we tend to learn by repetition and why eyewitness testimony is filled with faults.


This is anecdotal but the smartest people I know also seem to have the best memories of the people I know, maybe those things are related or it's just a coincidence in my circles.

But it wouldn't surprise me that a good memory can make you a lot more capable at other tasks.


So is our brain. “Model” by definition is an approximation of something else (or the thing we’re modeling… it’s a very overloaded term all right).

But if you think you can use this “realization” to dismiss AI or claim what it can or cannot do… you’re missing the forest for the trees.


I chatted briefly to overtone.ai a few weeks ago, heading back to the hotel from a conference. What they do is train an existing LLM model to detect things about the text (overtones, I suppose you might say). What's interesting in this context is that they train the AI using an English corpus, but once trained the AI is able to detect the same traits in other languages.

This sounds quite different from compression.


Nobody is arguing that the usecases are the same. In the end you can't even chat with gzip (although you could with it's predictor).

The thing is, that building the predictor is almost the same thing for compression and LLM. Of course the goals and taken tradeoffs are different. The paper shows this analogy.

ChatGPT et al use structured prediction to simulate intelligence. Building the predictor is fancy lossy compression.

Questions arise if lossy compression of things without copyright is legal or not. If I mp3 a lossless recording we currently think it is not legal. With LLMs this is not entirely clear yet.


As it happens I know zip. I ported that to linux back in 0.95 days, I think the code was called info-zip in those days. Chatting with its predictor is a fanciful description to say the least.

I've also read the transformers paper mentioned in the tweet.

Of course an LLM is on some level similar to a compression system, and on some level it's also just high and low voltages on some integrated circuits. Saying "just" glorified compression isn't something I'll believe without good arguments, though.


I once read that intelligence is compression (or similar). An abstract way of thinking about it, but if true, then AI is on the right track.


If language models and CNNs are already creating better compression algorithms than humans, aren't they then smarter than us?


In this analogy, the language model IS a man-made compression algo, so definitely not


Ok. Let me know when WinRAR learns to speak.


Yes. AI is just compression. So is understanding, fundamentally. That's literally what it's about.


If that would be the case then LLMs would know how to do math but they don't.


Popular chat-bot LLMs weren't trained to do math, they were trained to predict the next token in a conversational setting. The fact that they can do any math at all is a miracle considering that they have absolutely no thought process; they're just predicting the next token, one after the other, until some threshold is met and they stop producing output.

Framing it another way, it's sort of like if you asked your calculator to have a conversation with you and it actually had a fairly decent go of it. Sure, it wasn't grammatically correct a lot of the time and it struggled quite a bit. But it wasn't at all designed to speak conversationally, so the fact that it could respond at all should be rather impressive.


They can write proofs, so they can do math. If you mean they're bad at arithmetic, a big part of that is improper tokenization. If you do digit level tokenization, even small transformers can learn arithmetic and generalize to longer inputs.


They can do math - they can write code to be executed. It's not much different than me or you doing any maths more complex than reciting the times table.


Liked this view always (look for Hutter prize rationale), but I think it needs to be accomodated in a general perception/action loop that optimizes a lower level fitness/utility/reward (for instance an inner sense of pleasure/pain).


Isn’t childbirth a glorified decompression of parents’ DNA?

This is a tweet level of discourse



"White-Box Transformers via Sparse Rate Reduction" (2023) ; https://arxiv.org/abs/2311.13110 https://scholar.google.com/scholar?cites=1536453281127121652... :

> Abstract: In this paper, we contend that the objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a mixture of low-dimensional Gaussian distributions supported on incoherent subspaces. The quality of the final representation can be measured by a unified objective function called sparse rate reduction. From this perspective, popular deep networks such as transformers can be naturally viewed as realizing iterative schemes to optimize this objective incrementally. Particularly, we show that the standard transformer block can be derived from alternating optimization on complementary parts of this objective: the multi-head self-attention operator can be viewed as a gradient descent step to compress the token sets by minimizing their lossy coding rate, and the subsequent multi-layer perceptron can be viewed as attempting to sparsify the representation of the tokens. This leads to a family of white-box transformer-like deep network architectures which are mathematically fully interpretable. Despite their simplicity, experiments show that these networks indeed learn to optimize the designed objective: they compress and sparsify representations of large-scale real-world vision datasets such as ImageNet, and achieve performance very close to thoroughly engineered transformers such as ViT. Code is at [ https://github.com/Ma-Lab-Berkeley/CRATE.

"Bad numbers in the “gzip beats BERT” paper?" (2023) https://news.ycombinator.com/context?id=36766633

"78% MNIST accuracy using GZIP in under 10 lines of code" (2023) https://news.ycombinator.com/item?id=37583593


It seems this X post is from someone trying to claim LLMs are copyright violations.

The argument that LLMs are only a new compression algorithm is nonsense.


"Humans are just glorified monkeys"...


tl;dr Tweet removes any potentially boring nuance from the research in favor of sensationalism and upvotes.


Yelp. Reductionism and perfectionism to attack something. It’s just a {blank} or it can only detect cancer 98% of the time. I’d like to coin the the hot tub Time Machine argument but that’s probably already a thing.


Is generalization also a form of compression?


compression is just glorified flow of electrons. Still usefull non the less.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: