> Why unify information theory and machine learning? Because they are
two sides of the same coin. In the 1960s, a single field, cybernetics, was
populated by information theorists, computer scientists, and neuroscientists,
all studying common problems. Information theory and machine learning still
belong together. Brains are the ultimate compression and communication
systems. And the state-of-the-art algorithms for both data compression and
error-correcting codes use the same tools as machine learning.
* In compression, gzip is predicting the next character. The model's prior is "contiguous characters will likely recur". This prior holds well for English text, but not for h264 data.
* In ML, learning a model is compressing the training data into a model + parameters.
It's not a damning indictment that current AI is just compression. What's damning is our belief that compression is a simpler/weaker problem.
Completely agree. Every model/dimensionality-reduction is in some sense compression, isn't it?
We are taking a problem space with more parameters and detail and reducing it to a solution space with fewer parameters and possibly less detail (depending on whether the solution is exact or not which maps onto the compression being lossy or not).
If I take the set of all points that are equidistant from a given point and line, that is an infinite set. But I can compress it down to three real numbers if I know that set can be represented y = ax^2 + bx +c, and noone goes "quadratic equations are just compression".
A lot of people try to generate milage out of the idea of the compression being _lossy_ also. That's an intellectual dead end in the same way. Lots of useful models are a lossy compression.
I don’t think the statement is annoying to some because they see it as a damning indictment of AI.
It’s because it’s not literally true. If it were literally true, a zip file would be ChatGPT, but it is not. The paper was named in an intentionally provocative way.
What they really mean is, at the heart of these transformer models, compression is the fundamental principle or goal.
This trend in naming papers has evolved over time. It didn’t always used to be this way.
Some would say a provocative title is thought-provoking and drives a wider interest.
Others would say it’s a research paper and things in it should be literally true.
Compression and prediction are equivalent, and both measure some function of intelligence and knowledge. Current LLMs are heavy on the knowledge side, being trained on ~all the text.
I haven't read MacKay's work, so maybe this is naive, but I think the belief is that compression is a deterministic endeavor, whereas AGI may not be. The ability to recall information is incredibly useful; the ability to do so in convenient and flexible ways doubly so. The level of complexity being proportional to the GPT's ability to interpret means that compression is not necessarily a simple problem, like you say.
However, if intelligence involves nondeterministic traits, i.e. something in the vein of creation of the present, AGI could be a significantly different problem to solve than compression. I think there's at least an intuition that this is the case, which explains the belief that compression is a simpler/weaker problem.
As an aside, I'm currently unsure of my position on this.
I'm convinced that deterministic AI will never fully replicate human intelligence. I will attempt to explain my reasoning.
I can say with absolute certainty that I have subjective experience. If my behaviour is deterministic, I can imagine a philosophical zombie version of me that behaves the same but doesn't have subjective experience. That zombie, using the exact same reasoning as me, down to individual particle interactions, will (incorrectly) determine with absolute certainty that it has subjective experience and is not a philosophical zombie. Its reasoning, which is my reasoning, is thus flawed. Therefore, in order to believe my behaviour is deterministic, I must doubt my ability to reason.
(Some claim that philosophical zombies aren't possible to imagine. But I can imagine them, so if that's impossible then my reasoning is still flawed just for a different reason.)
I believe the same argument applies even if I assume my behaviour is non-deterministic but follows a probability distribution which is a function of the past -- the philosophical zombie has the same probability of reasoning that it has subjective experience as I do, and the rest of the argument is similar. Thus, I believe my behaviour is not only non-deterministic, but mathematically ineffable, and thus outside the realm of what computers can do.
If you have subjective experience which works similar to mine, maybe you can follow my reasoning. If you don't, the argument may appear to be nonsense due to the ineffability of subjective experience.
Your “experience” of subjectivity is really just an illusion.
Your subconscious mind makes very deterministic decisions milliseconds before your conscious mind is even made aware of it, and it feels like you’re deciding consciously but you’re actually absolutely not based on many studies.
And critically, those deterministic, predictable, mechanical responses still give rise to conscious experience of self (Which for me brings into doubt the possibility of p zombies).
Actually, due to parallel execution they are not running deterministically. Even if the temperature is zero some randomness is left in. You can run a model deterministically, but this is incredibly slow.
I love David MacKay's brilliant work on the Dasher text input system, which draws deeply from his work on information theory -- imagine Dasher integrated with an IDE and code search and Copilot and language model!
"Writing is navigating in the library of all possible books." -David MacKay
We just allocate more shelf space to the more probable letters.
Why isn't Dasher built into every operating system and mobile phone?
DonHopkins on May 18, 2018 | parent | context | favorite | on: Pie Menus: A 30-Year Retrospective: Take a Look an...
Dasher is fantastic, because it's based on rock solid information theory, designed by the late David MacKay.
Here is the seminal Google Tech Talk about it:
>Dasher is a zooming predictive text entry system, designed for situations where keyboard input is impractical (for instance, accessibility or PDAs). It is usable with highly limited amounts of physical input while still allowing high rates of text entry.
Ada referred me to this mind bending prototype:
D@sher Prototype - An adaptive, hierarchical radial menu.
>( http://www.inference.org.uk/dasher ) - a really neat way to "dive" through a menu hierarchy/, or through recursively nested options (to build words, letter by letter, swiftly). D@sher takes Dasher, and gives it a twist, making slightly better use of screen revenue.
>It also "learns" your typical useage, making more frequently selected options larger than sibling options. This makes it faster to use, each time you use it.
One important property of Dasher is that you can pre-train it on a corpus of typical text, and dynamically train it while you use it. It learns the patterns of letters and words you use often, and those become bigger and bigger targets that string together so you can select them even more quickly!
Ada Majorek has it configured to toggle between English and her native language so she can switch between writing email to her family abroad and co-workers at google.
Now think of what you could do with a version of dasher integrated with a programmer's IDE, that knew the syntax of the programming language you're using, as well as the names of all the variables and functions in scope, plus how often they're used!
I have a long term pie in the sky “grand plan” about developing a JavaScript based programmable accessibility system I call “aQuery”, like “jQuery” for accessibility. It would be a great way to deeply integrate Dasher with different input devices and applications across platforms, and make them accessible to people with limited motion, as well as users of VR and AR and mobile devices.
How should we unify these things? Most of it seems basically the same - we maximize log-likelihoods instead of calling it minimizing surprisal - but it's clearly the same thing. Are there any other ways to integrate these things together?
Not that I think the current AI is as life-changing as is purported, but this comparison is terrible. Almost all complex software is made up of a bunch of other previous simpler technologies.
I think the point was that the current transformer model paradigm will never be able to reach AGI, no matter how far you take it. It needs something fundamentally more to be able to do that. But maybe as you say, that something will be built on top of transformer technology.
I love how I've seen Ilya talk about it. If we could find the shortest program to reproduce the training set, that would be optimal compression. But it's an intractable problem, we can't even come close, there's just no way to come at it.
But with deep learning we can instead find a circuit that approaches reproducing the training set.
This is lossy compression. There's nothing "just glorified" about it though; the result is astounding.
A more appropriate takeaway might be that sufficient compression is mind-bendingly more powerful than intuition might otherwise guess.
Does calling it "just glorified" guide any intuition that in order to compress amazon reviews a neural net is gonna have weights that correspond to sentiment? Does it tell you that such compression also ends up producing an artifact that can be put to work in a generative framework? And that it'll be a very useful generative framework because such compression required weights that correspond to all sorts of useful ideas, a compression of something more meaningful than just text?
Calling it "just glorified X" is clickbait. It's compression alright, and it's either 1) also whole lot more, or 2) compression is a whole lot more wild and impressive than you thought, or both.
The "sentiment" weights are just the supervised classification part on top of the "compression" piece. An autoencoder is closer to pure compression. SVD is compression as well. SVD is also useful to solve some equations.
it hallucinates too much, there's no reflexivity, no critical reasoning. sure, nice prompts can give it some structure, and with a lot of pre- and postprocessing (ie. RAG and other ways of forcing it to pick from a list) it's good for information retrieval.
We also use to say that compression is the science of prediction.
Scientific models are also tools used to predict the results of experiments.
So if AI is able - through « compression » - to build models that help us predict experiments and understand the world better; it fully deserves the « intelligence » suffix.
intelligence is adaptive problem solving ability. one way of that is to observe the problem, build a model of it, come up with possible strategies, test them through simulation, and pick the best one.
current GPT with good prompts is a very nice puzzle piece of this, but not there .. yet.
This wasn't obvious to me a while ago, but I've come around to it. To me the important thing is that it's not only compression, it's lossy compression.
Models (any kind) are basically an imperfect approximation of some system, and that definition is precisely what's identified here.
You can demonstrate it by assuming some sequential process that gets an "AI" like an LLM to generate as much content as it can[1], then train a new generation model on this content. Then use the new model to generate as much content as it can, train a third generation model, and so on.
Since each generation may not produce content with every possible token it has been trained on, it stands to probability that eventually some tokens, especially low probability tokens, will simply be left out of the generated content. The next model will therefore lose that token and it will never be part of the content it generates, and so on.
After enough generations of this process, you eventually end up with a model with a single, or no, tokens it can use to generate text.
The problem is that the models are trained on so much information that they're effectively models of the textual content of our civilization. If we hand off the reigns to these systems, they may work well for a while, and may even seem to be produce novel ideas (but which are secretly just old ideas in combinations we haven't seen before), but they'll also produce content which will inevitably be used to train later models. As the content these models produces starts to dominate the ratio of human-text:model-text, there will be less of a reason to preserve the human text and we'll end up with the scenario above.
Things may work well for a while, perhaps a long time, but even if we don't end up with models trained on one token, we will end up with fewer and fewer ideas-as-text represented within them. Civilization will stagnate, the singularity will not commence, and things will slowly regress.
Of course this all presumes research stops, and somehow LLMs become religion and we turn society over to them. Which of course isn't what's going to happen. Is it?
I agree, however in that scenario, are people fully stopping their content generation? Because if not, the pool will grown larger and the problem you described would still be there but to a lesser extent.
You could just send a lot of bitmap files. Or, you could save bandwidth and zip them before send.
Or you could integrate some image-specific compression into the file format, as in motion-jpeg. And in mpeg2/h264/h265 you supercharge this with temporal compression, not just adjacent pixels, not just blocks within the frame, but also pixels and blocks from adjacent frames are used to predict each pixel.
And now think about not sending video, just a single initial jpeg, some face shape data, and now a continuous stream of position data of a dozen or so points on your face and the facial movements are reconstructed. (Nvidia research project from last year)
And now think about no longer sending images at all, just coordinates for points in latent space (aka very abstract image descriptions) and a super-fast diffusion model "generates" the right frames on the fly.
Where does "compression" end, where does "AI" start?
Ultimately, the marketing people decide.
Don't get hung up on the term "AI". "AI" is a bullshit hype buzzword. Calling it such serves no scientific-practical purpose, it is solely meant to impress venture capital.
It's fun to see people endlessly make this point while providing no mechanisms for how it works. We might say "it surely must be this way" but we haven't shown it affirmatively in any sense. Until a mechanism is shown, I think it's really irresponsible to make statements like this. You really have no clue what it's doing.
We know memories are essentially recreated when remembered, we know that eye-witness testimony is spectacularly prone to hallucinatory-details-in-the-llm-sense, we know that focusing attention on one aspect of a video will cause participants to simply not see remarkably obvious and noteworthy details of that video. No, I think it's safe to say that it's been shown pretty affirmatively.
Wings were inspired by birds, and although they mechanically function in a manner entirely unlike a bird's wing, the aerodynamics of aircraft wings fully explain why birds can fly.
Bird wings are actually quite a bit more sophisticated than aircraft wings. Don't let that distract you from your point about belittling human cognition though.
Did he say otherwise ? The point is...who cares ? Planes fly and they so very very well. They even fly faster than any bird will ever be able to. So who cares that bird wings are "quite a bit more sophisticated". Sophistication is not and never was the goal.
We should care because we're all meat. Do you want to live in a society where people are devalued because a computer program approximates memory (badly)? Humans ought to be treated exceptionally regardless of how good a computer is able to do things. Computers are already ridiculously better at arithmetic than us, that doesn't devalue human rights. If you think you've got a system close to cognition, you better be god damn sure because the alternative is an enormous amount of human suffering.
Yes, that was exactly the point. Technology developed in imitation of a much more sophisticated and incompletely understood animal example, for which a more complete theory was developed. And that more complete theory eventually became sophisticated enough to comprehensively explain the biological artifact.
(And in spite of that, we still keep birds around, because not everything of value about birds is captured by an aerodynamic analysis of a bird's wing. Obviously.)
It does feel like if the "fact" being shown is that our basic cognition is "like" a computing system, the burden of proof is a lot higher. When we make statements about how people are "just" doing compression in their brains, that's incredibly dehumanizing. Be careful with this line of thinking, it doesn't lead to nice places.
I mean, surely you can just apply your very own experience to the matter, right? Try this: remember what you were doing at this time last week, down to the minute, as though you were currently experiencing it. Every sight, every sound, every movement and every sensation.
Can you easily do that? What were your eyes focusing on as they moved 2-3 times per second on average? Can you remember each movement? But okay, you say, you weren't actually learning or "focused" or whatever at that point, so that doesn't count. Alright, try this: recall the times table you learned in grade school. Go ahead and say them out loud in whichever order you memorized them in.
When you were doing that, what was happening in your mind? Did you picture that crusty old laminated times table that was up on the wall across from you, hearing the sounds of the other children writing and moving in their seats? Did you feel the hard plastic seat under you as you desperately tried to instill that 7x12 was 84?
So good memory just means you have a good compression algorithm?
I guess if you remember "Sam said a rude thing today about my work" instead of exactly what was said it is very hard to clear up afterwards, and you will hallucinate a lot more rudeness than was actually there.
If the compression algorithm for how to store such memories are different in every person it explains a lot of misunderstandings.
Edit: However not many would say that good memory means you are smart, some people remember a lot of things very well but are still dumb in practice. So I don't think that compression is enough to define intelligence.
On this, typically people have terrible memories, especially of one off events. This is why we tend to learn by repetition and why eyewitness testimony is filled with faults.
This is anecdotal but the smartest people I know also seem to have the best memories of the people I know, maybe those things are related or it's just a coincidence in my circles.
But it wouldn't surprise me that a good memory can make you a lot more capable at other tasks.
I chatted briefly to overtone.ai a few weeks ago, heading back to the hotel from a conference. What they do is train an existing LLM model to detect things about the text (overtones, I suppose you might say). What's interesting in this context is that they train the AI using an English corpus, but once trained the AI is able to detect the same traits in other languages.
Nobody is arguing that the usecases are the same. In the end you can't even chat with gzip (although you could with it's predictor).
The thing is, that building the predictor is almost the same thing for compression and LLM. Of course the goals and taken tradeoffs are different. The paper shows this analogy.
ChatGPT et al use structured prediction to simulate intelligence. Building the predictor is fancy lossy compression.
Questions arise if lossy compression of things without copyright is legal or not. If I mp3 a lossless recording we currently think it is not legal. With LLMs this is not entirely clear yet.
As it happens I know zip. I ported that to linux back in 0.95 days, I think the code was called info-zip in those days. Chatting with its predictor is a fanciful description to say the least.
I've also read the transformers paper mentioned in the tweet.
Of course an LLM is on some level similar to a compression system, and on some level it's also just high and low voltages on some integrated circuits. Saying "just" glorified compression isn't something I'll believe without good arguments, though.
Popular chat-bot LLMs weren't trained to do math, they were trained to predict the next token in a conversational setting. The fact that they can do any math at all is a miracle considering that they have absolutely no thought process; they're just predicting the next token, one after the other, until some threshold is met and they stop producing output.
Framing it another way, it's sort of like if you asked your calculator to have a conversation with you and it actually had a fairly decent go of it. Sure, it wasn't grammatically correct a lot of the time and it struggled quite a bit. But it wasn't at all designed to speak conversationally, so the fact that it could respond at all should be rather impressive.
They can write proofs, so they can do math. If you mean they're bad at arithmetic, a big part of that is improper tokenization. If you do digit level tokenization, even small transformers can learn arithmetic and generalize to longer inputs.
They can do math - they can write code to be executed. It's not much different than me or you doing any maths more complex than reciting the times table.
Liked this view always (look for Hutter prize rationale), but I think it needs to be accomodated in a general perception/action loop that optimizes a lower level fitness/utility/reward (for instance an inner sense of pleasure/pain).
> Abstract: In this paper, we contend that the objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a mixture of low-dimensional Gaussian distributions supported on incoherent subspaces. The quality of the final representation can be measured by a unified objective function called sparse rate reduction. From this perspective, popular deep networks such as transformers can be naturally viewed as realizing iterative schemes to optimize this objective incrementally. Particularly, we show that the standard transformer block can be derived from alternating optimization on complementary parts of this objective: the multi-head self-attention operator can be viewed as a gradient descent step to compress the token sets by minimizing their lossy coding rate, and the subsequent multi-layer perceptron can be viewed as attempting to sparsify the representation of the tokens. This leads to a family of white-box transformer-like deep network architectures which are mathematically fully interpretable. Despite their simplicity, experiments show that these networks indeed learn to optimize the designed objective: they compress and sparsify representations of large-scale real-world vision datasets such as ImageNet, and achieve performance very close to thoroughly engineered transformers such as ViT. Code is at [ https://github.com/Ma-Lab-Berkeley/CRATE.
Yelp. Reductionism and perfectionism to attack something. It’s just a {blank} or it can only detect cancer 98% of the time. I’d like to coin the the hot tub Time Machine argument but that’s probably already a thing.
> Why unify information theory and machine learning? Because they are two sides of the same coin. In the 1960s, a single field, cybernetics, was populated by information theorists, computer scientists, and neuroscientists, all studying common problems. Information theory and machine learning still belong together. Brains are the ultimate compression and communication systems. And the state-of-the-art algorithms for both data compression and error-correcting codes use the same tools as machine learning.
* In compression, gzip is predicting the next character. The model's prior is "contiguous characters will likely recur". This prior holds well for English text, but not for h264 data.
* In ML, learning a model is compressing the training data into a model + parameters.
It's not a damning indictment that current AI is just compression. What's damning is our belief that compression is a simpler/weaker problem.