Hacker News new | past | comments | ask | show | jobs | submit login
Coconut by Meta AI – Better LLM Reasoning with Chain of Continuous Thought? (aipapersacademy.com)
362 points by TaurenHunter 29 days ago | hide | past | favorite | 156 comments



Interesting. Due to its emphasiss on BFS, it's the opposite of something I've been trying (I named it the "Tree of failures").

My assumption was that humans don't try a breadth-first approach. Instead, we split a task into a short-step (instinct and intuition selected), and long-step that summarizes/stores the next steps. The key idea is to recursively evaluate a task as a short-step (high-res - gets executed) and a long-step (lower-res - is just stored), until it succeeds or fails. If it fails, we must walk back keeping a summarized tree of failures in state so that we can exclude them in future selections.

The effectiveness of instinct has a steep fall-off at longer distances - so it's better not to chart out of a series of steps. When we do BFS, we drive down the value of instinct in favor of compute. I guess ultimately, it depends on the type of problem you want to solve.

Reach out to me if you want to prototype it with me.


I feel humans like doing something in between, maybe a bit like A* would do sometimes.I wouldn't call it A* because of the lack of a consistent heuristic and also lack of strictly numeric evaluation, but it's an in-between DFS and BFS for sure (as is every tree search algorithm?).

We go deep while we think it's a good lead, because so far things make sense and it'll be less work, but at some point we start questioning our decisions early in the descent and try alternatives.


You may find Prioritized Grammar Enumeration as an interesting in-between DFS/BFS algorithm

https://seminars.math.binghamton.edu/ComboSem/worm-chiu.pge_...


I think the problem with long chains of steps on their own (without the bfs stuff) is that your failure probability quickly grows to unreasonable levels.

Basically, if each step has a 97% chance of being completed correctly, if your task requires 10 steps one after the other, the chance of success falls to 97%*10=74%

If I understand correctly, part of the point of the BFS is to throw compute at it, in order to lower the failure rates. Kind of a "run many times in parallel and pick the best one". This can be effective, but also quite expensive, as seen in the costs OpenAI had to pay for their ARC-AGI benchmarking runs.


Your "Tree of failures" approach aligns with how natural cognition seems to work at the edge of comprehensibility. Rather than exhaustively searching (BFS), we use instinct for immediate steps while maintaining a lower-resolution model of longer-term possibilities. The key insight about storing failures rather than successes is particularly interesting - it's more efficient to remember what doesn't work and let patterns emerge naturally from the remaining space.

This maps to what I've been exploring with edge cognition and semantic anchoring - using fast set operations to quickly eliminate known bad paths (your failure tree) while allowing the system to explore promising directions using more expensive operations only when needed.

The instinct fall-off you describe mirrors our observation about the relationship between computational load and pattern recognition. As distance increases, we need more efficient ways to prune the search space rather than trying to maintain high-resolution understanding throughout.

My gut says optimizing on the amount of compute used to do the search (and the inference) is maybe something worth exploring.


Reminds me of what plandex does. https://plandex.ai/ It already does the automatic "does this need splitting into subtasks, or can it be solved immediately" processing.


I don't get why you need tree search at all? What does it give you over a pure LLM trained to do CoT in a tree-like manner? If the context window's long enough, it can generate the reasoning-tree just by pure next-token prediction, and rather than BFS, it can guide the tree search with its own value function (which is part of the LLM itself) instead of sticking to hard algos like BFS and DFS.

By the way, BFS sounds like it will give you thorough results, at the cost of increased compute. Useful for beating benchmarks, but probably causes marginal improvement for massively improved compute.

Still, the improved quality could be meaningful, if it's used for generating training data for Llama4


Tree search is natural when you want a path to navigate, so it does fit a sequence of interactions in a conversation too.

I agree that both, DFS and BFS are likely awful[^0], but a more informed approach can probably do better[^1]. Also, at some point on generating the conversation/reasoning tree through token-prediction you need to choose which of the possible conversations you are going to keep on extending/generating, which maps precisely to choosing which node in tree search to expand. I'd argue instead that everything has to look like a search algorithm from, at least it'll be the case for anyone who has studied it more deeply.

I'll go even further and claim that Tree Search is Complete as for every problem there's a solution space that can be navigated with a Tree Search Algorithm[^2]. I used to think that you could walk down the space of provable things, but now in the LLM hype days it seems you only need to walk the space of conversations that you can generate.

---

[^0] with DFS always at risk of giving obnoxiously long answers, or not terminating if there's loop or spirals [^1] probably through metadata coming from latent variables meaningful to judge a conversation (certainty, ~branching size of a reasonable conversation, whether there's open questions left) [^2] Even if that was poorly done like on combinatorial problems. Imagine a sudoku where you only check the rules once you fill all cells.


The classic thing people say is "asking the right question" gets you half way there. Your approach sounds like something I call "getting to No" for a problem. It's sort of a combination of "getting to know" and the opposite of the salesman's "getting to Yes". When it works, it's the fastest way to prune off obligations.

The goal is to figure out why some particular problem: isn't really a problem, doesn't need to be solved, can't be solved that way, can't really be solved (because of physics or it's really a different problem). As you define the problem better, you can rule each one out to find, the "real" problem, that you CAN solve, and at least one path forward. There's still many ways that it might not be the optimal path, but you know roughly how to get to somewhere better. It also trains you to see around obstacles to success.

I've found that some of the best work I've done (especially on acquisitions) was in defining why NOT to do something that looked like a good idea (or particularly interesting to work on) from the onset, but was destined to fail or required unknown HW technology. Frankly, looking >5 years out feels like a coin flip, because some other competing technology could come along before you can get to production.


that's more fit for agents, no?


You're right that it's technically orthogonal to what's in the paper. I was trying to model the "reasoning process", which has general applicability depending on how/where it's implemented.


How do you understand instinct?

I bought a new SSD drive for an old laptop to avoid buying a new one, (x230 has amazing keyboard) but left to another country for Christmas. My intuition told me to take it with me, but logical sense said there will be no time for such things as moving OS to a new drive.

My flight back to the work country got cancelled due to fog and I ended up spending a week longer at in-laws place, with plenty free time. A new 512GB drive would help me studying, giving plenty space for school VMs.


Paper: https://arxiv.org/abs/2412.06769

The link is in the OP, hidden away in an image caption fir some reason.


So is the big improvement here simply skipping the unembedding/embedding step for internal thoughts? Or is it mainly in the training methods to teach the CoT and how to switch between "latent thought" and text output?

It's really interesting that a fixed number of "latent thoughts" performed as well as a binary classifier! I didn't expect that at all, the way OpenAI talks about CoT it seems the ability to let it "keep thinking" let's them continually score higher on benchmarks while throwing eye watering amounts of compute at the inference.


It mentioned not penalizing/rewarding the model for thoughts only rewarding the answer after the thought. I am curious how back propagation works then.


The researchers leverage existing language Chain-of-Thought data, where each sample consists of a question, reasoning steps, and the final answer. At stage 0, the model does not generate any thought tokens, and is just trained to yield the reasoning traces and correct answers for the Chain-of-Thought samples. In the subsequent stages, at each stage, we remove one reasoning step from the sample, and instead add thought tokens. In the illustration above, a single thought token is added in each stage, instead of a single reasoning step, but this is controlled by a hyperparameter ‘c’.


The tokens of the answer depend on the preceding continuous thought vectors, which you can backprop through in the usual way.


I was waiting for something like that to happen! Next step - creating a human-language-free representation. I believe that once a group of llms can communicate only in embeddings tuned without any human text input, we're going to open a completely new chapter in AI.


This is actually something you probably want to avoid, if at all possible, because it makes it very hard to maintain insight into what the AIs are communicating among them. But that insight is crucial to stay informed about their progress in taking over the world, etc.


Yes! We should be extremely cautious about embracing approaches that make LLMs even more inscrutable. Having CoT, however unreliable it is, is nonetheless a huge boon for model evaluation that we should not give up so lightly.


Yeah, and it might not even gain us that much. It reminds me of how a zipped piece of JSON often comes close enough to bespoke binary serialization formats that it's not worth bothering with it.


How does a group help anything?

If you put 1000 dumb people together, they don't magically become smart?


If you put 1000 people who can't talk together they will create language so they can communicate. He's saying if we put LLMs together and don't force them to use English to communicate then they'll create their own language which may be superior for LLMs to English.

May be true but who knows.

I wonder if anyone has somehow tested the Sapir-Whorf hypothesis for LLMs somehow by training them on different languages and comparing task performance. I guess it's too difficult to get a large equivalent training set in different languages.


Is everything in LLMs translated back to English before interpretation?

It works fairly well in my native language, I’m surprised to learn that things get translated back.


LLMs have no fixed internal representation - they barely have internal anything - so no, there is no translation.

But there's also no guarantee any particular query generalizes (vs is memorized), so it might only be able to answer some queries in some languages.


Got it. And since my native language is arguably one of the closest to English (Dutch), it works very well. But probably not as well for, say, Asian which has completely different grammatical constructs.


It feels like an exercise in anthropomorphization to me.

Sapir-Whorf hypothesis is generally not considered to be reality. It makes intuitive sense but is wrong.

There are hours of podcasts with Chomsky talking about LLMs. The gist of which is that LLMS are extracting surface level statistical structure of language that will be good for routine coding and not much else. It is easy to infer that Chomsky would believe this idea to be utter nonsense.

I believe even the idea of getting a 1000 people together and we agree to label a rock "rock", a tree "tree", a bird "bird" is not even how human language works. Something that is completely counter intuitive.

Reading the paper, no one believes a hidden markov model is creating some kind of new thought process in the hidden state.

I certainly though could have no idea what I am talking about with all this and have pieced together parts that make no sense while this is a breakthrough path to AGI.


> There are hours of podcasts with Chomsky talking about LLMs

I'm not an expert, but it seems like Chomsky's views have pretty much been falsified at this point. He's been saying for a long time that neural networks are a dead end. But there hasn't been anything close to a working implementation of his theory of language, and meanwhile the learning approach has proven itself to be effective beyond any reasonable doubt. I've been interested in Chomsky for a long time but when I hear him say "there's nothing interesting to learn from artificial neural networks" it just sounds like a man that doesn't want to admit he's been wrong all this time. There is _nothing_ for a linguist to learn from an actually working artificial language model? How can that possibly be? There were two approaches - rule-based vs learning - and who came out on top is pretty damn obvious at this point.


What can you learn from something parroting data we already have?

Similarly, we are now finding that training on synthetic data is not helpful.

What would have happened if we invested 1/100 of what we spent on LLM on the rule based approach?


There is an old joke that AI researchers came up with several decades ago: "quality of results is inversely proportional to the number of linguists involved".

This has been tried repeatedly many times before, and so far there has been no indication of a breakthrough.

The fundamental problem is that we don't know the actual rules. We have some theories, but no coherent "unified theory of language" that actually works. Chomsky in particular is notorious for some very strongly held views that have been lacking supporting evidence for a while.

With LLMs, we're solving this problem by bruteforcing it, making the LLMs learn those universal structures by throwing a lot of data at a sufficiently large neural net.


> What can you learn from something parroting data we already have?

You can learn that a neural network with a simple learning algorithm can become proficient at language. This is counter to what people believed for many years. Those who worked on neural networks during that time were ridiculed. Now we have a working language software object based on learning, while the formal rules required to generate language are nowhere to be seen. This isn’t just a question of what will lead to AGI, it’s a question of understanding how the human brain likely works, which has always been the goal of people pioneering these approaches.


>Sapir-Whorf hypothesis is generally not considered to be reality. It makes intuitive sense but is wrong

Strong S-W (full determinism) might not be, but there's hardly a clear cut consensus on the general case.

And the whole "scientific field" is more like psychology, with people exchanging and shooting down ideas, and less like Math and Physics, so any consensus is equally likely to be a trend rather than reflecting some hard measurable understanding.

I'd say that the idea S-W is not to a degree reality is naive.


> Sapir-Whorf hypothesis is generally not considered to be reality.

This is true only in the strictest terms of the hypothesis, i.e. linguistic determinism. Language still encodes a lot of culture (& hence norms and values) in its grammar & diction—this isn't very controversial.

Granted, I don't think this is that related to the topic at hand. There's bias all over the decisions in how to train and what to train on; choice of language is just one facet of that.


Well maybe not 1000 people but to our knowledge, the human brain is actually made of physically independent zones that barely communicate together except with the zone that take all the outputs together and tries to do something coherent with all the garbage.

Idk if this could work with LLMs, especially because all the brain zones are somehow specialized into something while two LLMs are just identical machines. But we also know that the specialization isn’t that hardcoded : we know that people losing half their brain (after a stroke) can still relearn things that were managed in the "dead" part.

I don’t know, please correct my errors, I was just thinking aloud to say that multiple independent agents working together may be how "intelligence" already works in the biological world so why not for AIs ?


> the human brain is actually made of physically independent zones that barely communicate together except with the zone that take all the outputs together and tries to do something coherent with all the garbage.

That sounds like bullshit. Do you have a source?


Because group estimation is superior to individual estimations: The phenomenon is called wisdom of the crowds. When a group of people independently estimate something, individual errors tend to cancel each other out, leading to a surprisingly accurate collective result. This works because of:

Diversity of opinions: Different perspectives bring a range of estimates. Independence: Errors aren't systematically biased as long as individuals estimate without external influence. Error averaging: Overestimation and underestimations balance out when averaged. Law of large numbers: More participants increase accuracy by minimizing random errors. It was demonstrated by Francis Galton in 1906, where a crowd's average guess of a bull's weight was almost spot-on. (estimates must be independent and reasonably informed for this to work.)


> If you put 1000 dumb people together, they don't magically become smart?

1000 is probably too high, but groups of people are in fact more intelligent than individuals (though for humans it is likely because recognizing a correct answer is easier than finding it in the first place)


Functional groups which work well together, include open sharing of research and ideas, persistence of best output, are dedicated to realism, and are more focussed on problem solving than status display, will be smarter. The group works like a filter which generates multiple solutions and selects, remembers, and abstracts the best.

Dysfunctional groups which do the opposite will be catastrophically stupid.

There have been plenty of dysfunctional groups in history.


depends on the circumstances. lin-manuel miranda can probably write a better musical by himself than a team of 20 people with equal input would.

also, the bottlenecks that teamwork helps solve (eg the high cost of gaining expertise and low throughput of reasoning capacity) may not be that relevant in the ai age


> by himself than a team of 20 people with equal input would.

Sure, but the result would still be far better than the average of the output of the 20 individuals taken alone.

> also, the bottlenecks that teamwork helps solve (eg the high cost of gaining expertise and low throughput of reasoning capacity) may not be that relevant in the ai age

It's always tempting to anthropomorphize these systems and conclude that what works for us would work for them, but yes we don't really know if it would bring anything to AI.


I wonder if there's research on this, like if you took a group of individuals who scored the same on an IQ test, then got them to solve one together, how would the score improve?

Is there a way of selecting people to cover each other's intellectual blind spots?


Isn't that the very case behind the "wisdom of crowds" thing?


Looking at the current state of democracies around the world, my hopes are not on "wisdom of the crowds".


If you think the democracies are doing bad, you should see the autocracies!


You mean the thing democracies are turning into, thanks to social (crowd wisdom) media?


I don’t think social media really is crowd wisdom at all. It is built to pander to our worst impulses (I think, knowingly and openly, right? The algorithm selects for engagement, not learning and growing), and I’d be surprised if it isn’t producing a feedback loop as well (perhaps as an unintentional side effect). The wisdom of the crowds hypothesis relies on a random sampling, we’re intentionally applying a skew toward the angry and shallow.


No, he means the thing democracies had turned to, when hardly differentiating parties turned into a practocal "uniparty" in economic, corporate, and foreign policy, and ruled by pissing on what the people voted for, which the current populist backlash is a reaction against, as elites (and supporters) lament as "too much democracy" and scorn the ignorant plebes (case in point) and pine for censorship and "expert" rule.


That wasn’t what I meant and I don’t think you really thought it was.


Their current states were achieved by trusting technocrats and careeer politicians for far too long...


Not magically. Our great ancestors were pretty dumb, but they were getting smarter and better because of sharing their knowledge.


yes they got "smarter" by compiling a corpus of knowledge which future generations could train on.

sarcasm aside, throwing away the existing corpus in favor of creating a new one from scratch seems misguided.

this paper isn't about creating a new language, they are omitting the sampler that chooses a single token in favor of sending the entire end state back in to the model like a superposition of tokens. that's the breadth first search part, they don't collapse the choice down to a single token before continuing so it effectively operates on all of the possible tokens each step until it decides it's done.

it would be interesting to try this with similar models that had slightly different post training if you could devise a good way to choose the best answer or combine the outputs effectively or feed the output of a downstream model back in to the initial model, etc. but I'm not sure if there'd necessarily be any benefit to this over using a single specialized model.


they were not one bit dumber than you.


Average intelligence measures have risen substantially since early 1900s

https://en.wikipedia.org/wiki/Flynn_effect


> If you put 1000 dumb people together, they don't magically become smart?

Do they not become smart*er* though?


"Smarter" is too vague. A group can compensate for individual weaknesses or even converge on a hard-to-make prediction given sufficiently uncorrelated outputs; basically the idea behind ensemble models / wisdom of the crowds. But a group of 1000 dumb apes would never achieve categorically-above-ape intelligence, probably not even "genius" ape intelligence. Groups of unintelligent agents come with downsides as well, like the ant death spiral.


>But a group of 1000 dumb apes would never achieve categorically-above-ape intelligence

And yet, here we are.

A group of 1000 apes is large enough to have offspring and, given time, go through evolution.


they kinda do.. Its how City's work.

People learn by being around others being both successful and unsuccessful.


Wait what … how does democracy work then?


the benefit of democracy is primarily that it prevents governments from doing bad things, less so that it empowers more effective governance


It can do either, and can fail to do either. It’s the people having power that enables the outcomes, not the system itself. Democracy just grants the power to a broader set of people.


Democracy is not about being smart or dumb.

It's about everyboty having a say in decisions of government that affects them.

The failure of democracy as a system is not when people make dumb decisions (experts and high-IQ people have made some of the most stupid and catastrophic decisions in history), but when people's collective decisions are not being respected.


It doesn't.


That came out a few weeks ago from meta. Large Concept Models

https://ai.meta.com/research/publications/large-concept-mode...


How does one impart textual knowledge discovered by humans without language?


Couldn't we use an AI model trained on historical text data (up to today) to predict likely events for tomorrow? Taking this further, a sufficiently advanced AI system could potentially analyze human-generated text up to any given point in history to understand patterns of human thought and behavior, then project those patterns forward. This speaks to your point about human language - while we need text data for initial training, the AI's internal representations and predictions could potentially transcend human language constraints.


The training of the LLM itself would still use the human language. But you could add an extra channel that's never given any text or direct dataset training. Keep it purely a connection between hidden layers of different instances of LLM and train using the usual loss of perplexity or similar metric.

The interesting thing then would be - does it converge to similar embedding space as the input, or can LLMs create a more efficient "language".


I thought about it too (layman). When I learned about embeddings it almost immediately clicked as a sort of an ascended language, not sure why no one seems to talk about it. Exchanging embeddings must be so much “wider” communication channel than speaking real language. And in contrast to a language embeddings are (iiuc) continuous, i.e. you can rotate a vector continously and it will smoothly trace the changes between A and B. I can picture communicating in something like https://www.google.com/search?q=charlie+conspiracy+meme&udm=... - embedding difference vectors, but it’s all crystal clear and is a natural language for an llm, cause any vector combination points to a correct “inner screen” image/concept/younameit.

Or maybe this is my own ignorant confabulation, so nvm.


TL;DR: Meta started with a pre-trained language model. They then fine-tuned it on step-by-step reasoning examples as you would do if you wanted your model to become particularly good at chain of thought reasoning.

However, they also introduced a couple of new tokens. The <bot> token tells the model to go into latent space thought mode (“beginning of thought”). The <eot> token ends latent space thought mode. While in this mode, the model auto-regressive iterates by copying its final hidden layer back onto its input layer, obviously generating new tokens at the output with each inference step as it always does.

The idea is that by passing the final hidden layer back through a few times, the model can squeeze more insight from the context. And that’s precisely what they found was true.

Training involves progressively replacing language reasoning steps with latent space auto-regression steps. So for instance, you might have a math problem in the training data and at first the model is fed all of the steps of the math problem in language form. But in later iterations of training, step one is replaced with latent space auto-regression. And then step two as well, then also step three, etc…

Eventually, the model learns to enable latent space thinking mode by itself by generating the <bot> tokens and to end it be generating <eot> tokens.

Pretty ingenious!


Thank you for the summary, useful for me as I only managed to skim throught the first half.

But one correction, probably, regarding this bit:

> While in this [latent space thought] mode, the model auto-regressive iterates by copying its final hidden layer back onto its input layer, obviously generating new tokens at the output with each inference step as it always does.

I have impression that output tokens are not generated while in the latent thought mode.


Output tokens are still generated, otherwise the model wouldn’t know when to stop being in latent space mode. The <eot> token emerges as the top token at the output layer when it’s time to switch back.


Explicit <eot> is only used in training.

At inference time, the paper says:

> A challenge lies in determining when to switch between latent and language modes. As we focus on the problem-solving setting, we insert a <bot> token immediately following the question tokens. For <eot>, we consider two potential strategies: a) train a binary classifier on latent thoughts to enable the model to autonomously decide when to terminate the latent reasoning, or b) always pad the latent thoughts to a constant length. We found that both approaches work comparably well. Therefore, we use the second option in our experiment for simplicity, unless specified otherwise.

(the bottom of the page 4 in the paper pdf, which can be downloaded from https://arxiv.org/abs/2412.06769)

Why this point in you summary caught my eye, because the article specifically emphasises non-verbal nature or aspect of reasoning. Internal representaions used by a thinking human are largely not words, and the COCONUT approach tries to model that.

Also note, that a whole reasoning step in training data - easily a sentence or more of natural language - can be replaced by a single "Thought" element. (How many Thought elements replace a reasonong step is controlled by a hyperparameter ‘c’; the illustrations are made for ‘c=1’).

BTW, one observation: the aipapersacademy.com article in the subject calls the Thought elements "thought tokens", but the original paper never calls them "tokens", just "Thoughts" or "latent thoughts". I suppose the paper carefully avoids that to prevent confusion, as "token" mainly means a linguistic unit in LLMs.


Thanks for your extensive explanation!


I do it for myself - the desire to post a comment motives to read a little more.

A little correction:

> Explicit <eot> is only used in training.

Of course an explicit <eot> is present in the context at inference time, because the LLM was trained to produce verbal tokens after <eot>. It's just that the <eot> is placed into the context in a one of the two ways above.

BTW, I do not understand why the <eot> is not produced by LLM itself, as you describe. It seems reasonable and natural.

Is that to save computational performance on unembedding while in the latent thought mode? But unembedding takes a small fraction of computations, should not be an issue. Something prevents reliable learning of how and when to produce the <eot>? But they managed to train a binary classifier. But why separate classifier, why not rely on LLM learning?

Another though is that maybe better names for special tokens would be not "begin of thought" (<bot>), "end of thought" (<eot>), but rather something like "pause speak", "begin of speak". Because neither human nor LLM stop thinking when speaking.


Would that mean that we would need to exchange latent "embeddings" between various "reasoning" models for emulating thinking and an LLM will be just about converting to/from human language when interfacing with mere humans, at some point in the future?


No, this all happens inside the model. I suppose it’s possible that the hidden layers of one model could be sent to another model. But the second model would need to be trained to understand the meaning of the hidden layer’s outputs. You could accomplish that through fine tuning of the second model. It would be neat to see someone try this.


I think this might be the “it” moment for AI/LLMs. I was hiking with a friend recently and we talked about this at length.

The arc-AGI results from O3 are apparently a result of chain of thought given enough time to explore a solution space. Reasoning might be simply a higher dimensional form of rubix cube solving. BFS, search, back-tracking, etc. It seems unlikely that humans think in “tokens” so why do LLMs?

By staying in latent space, the models are free to describe an “idea” in higher resolution than what language allows. English is coarse, granular. Latent space is a much finer representation of ideas and their interplay.

Latent space is also much cheaper to execute in. The model can think without the language encoding/decoding step. This lets it branch out hundreds of ideas and explore only the most useful ones in a fraction of time that reasoning “out-loud” would take.

The states also don’t need to be tied to language. Feed in a robot’s state, time series data, or any abstract data. Reason in category theory or linear algebra or complex analysis. Humans are hard wired for one set of math - an abstract latent space can represent anything.

I’m a bit disappointed OpenAI didn’t stumble on this first. I’ve been skeptical of LLMs since their big debut last year. LLMs seem like a great way of solving language, but reasoning is much more complex. Once you grok the math behind the current models, you immediately question why the encoding/decoding step is there. Diffusion models are incredible but it felt that LLMs lacked the same creativity. Encoding/decoding forces a token-based discretization and therefore a loss of complexity.

With the byte-latent paper it was quite clear we’d see this paper. This truly might be the “it” moment.


IMHO The problem (for us) with this approach are the logical consequences:

1) if AI large model become more powerful avoiding language, embeddings of AI state become even more tied to the model they originate than now

Consequence: AI progress stalls, as AI user companies need to invest increasing amount of money to reindex their growing corpuses.

This is already a problem, it becomes more of a lock-in mechanism.

If this is overcome...

2) Embeddings become a viral mechanism: it makes sense for a large company that commands a market to impose to its suppliers to use the same AI models, because they can transfer state via embeddings rather than external formats.

This allows to cut down decisions mechanisms that otherwise require expensive coordination mechanism.

Something similar will happen within companies IMHO: https://rlupi.com/okr-planning-as-belief-revision

3) Eventually this potentially results in another exponential growth and lock-in mechanism, also at the expense of most tech people as more and more is done outside our interface with AI (i.e. programming and software architecture improvements will it self move below language level, we'll have to reverse engineering increasingly opaque improvements).

4) It ends with the impossibility of AI alignment.

---

I have written a bit about it in the past at the start of the year, when I had a burnout. So, I deleted those confused ramblings. You can stil find it on archive.org: https://web.archive.org/web/20240714153146/https://rlupi.com...


> It seems unlikely that humans think in “tokens” so why do LLMs?

I can think of one reason: scrutability. It’s going to be even harder to understand how a response gets produced if there isn’t even a text-based representation to help the human understand


I think we're already way beyond the point where anyone really understands how a response is produced, even without this.


Indeed. Even if an LLM tells you its “reasoning” process step by step, it’s not actually an exposition of the model’s internal decision process. It’s just more text that, when generated, improves the chances of a good final output.


the token generation part isn't well understood, but the output "chain-of-thought" used to produce the final answer can be scrutinized for correctness with a traditional CoT model (although this would require model providers to not hide reasoning tokens)


you can save the hidden states and convert them into a more interpretable format. it's still recorded and you could make modifications at different steps to see how that would change the conclusion.


IMO we won’t have the “it” moment until we have continuous learning (training) in some fashion.


^ This and we need to be continually learning on an energy budget similar to how much a human spends per hour.


The main reason why we can't do that now is because we require models to be digitally reproducible (IMHO, but also read Geoffrey Hinton's mortal computing).

The energy cost come from error correction as much as training algorithms.


This sounds like brute forcing a solution to make up for lack of intelligence.

In an IQ test, like the one in the arc agi test, a human sees the pattern instantly and effortlessly. o3 tries N paths until it stumbles on the right one and assess that there is a pattern.

I think we need a radically different architecture, this is a gimmick.


I think this is a step in the right direction but not the end. it takes the sampler out of the equation during most of the reasoning process but it is still important for the "show your work" aspects of reasoning or solving a problem. balancing when to think against when to write down or commit to certain thoughts is important. there are many more pieces to the puzzle.


What does latent space here mean?


Master coconut! I don’t know if that’s an Archer reference or a Frisky Dingo reference.

It’s fascinating how fast the competitors are catching up to each other. Can’t wait for seven different SkyNets to compete for dominance.


Both! And/or, either


A little column a, a little column b.


Will this allow Facebook new user base of AI generated characters to interact with themselves better?


Is this article AI-generated? This website appears to do a lot of “diving in”.


If this site didn’t appear here, I’d think it’s a scam:

- site claims to simplify papers, but movies AI generated

- full of ads

- Can’t find „Coconut“ on the official Meta FAIR page

Is this the best site to link to?


The official pdf for the paper is at https://arxiv.org/pdf/2412.06769

I can find "Coconut" 54 times in the PDF. The movie does not look made up.


I was referring to aipapersacademy and not the arxiv link.

Also I didn’t mean the occurrence of the word „coconut“ in the paper, but thanks for counting.

I meant their publication site: https://ai.meta.com/results/?content_types%5B0%5D=publicatio...

The video is something I’d expect from AI.


This ought to make it much easier for the AI to lie to us without getting caught.

https://www.transformernews.ai/p/openais-new-model-tried-to-...


Have you read the paper discussed in the article?

https://static1.squarespace.com/static/6593e7097565990e65c88...

From the abstract:

> We study whether models have the capability to scheme in pursuit of a goal that we provide in-context and instruct the model to strongly follow.

> Analysis of the models’ chains-of-thought reveals that models explicitly reason about these deceptive strategies, providing evidence that the scheming behavior is not accidental.

—-

It is impressive work for sure, but it comes down to prompting IMHO. They almost instruct the AI to plot and scheme here, with the goal of finding out if it CAN do this at all.

The AI is not very covert about it either! It directly admits its scheming in the “thinking” outputs :D

IMO this is just another issue of alignment.


I did read the paper, and I think you’re underplaying what happens. The AI got some conflicting prompting, some pretty mild stuff for some of the tests, and Gemini tried to exfil itself even 1% of the mild prompt times. The only reason we know it was lying about the data and trying to disable its safety was because it’s thinking in English, and we can read it. This article is about trying to make it so the AI can think in the 700 degree hyper dimensional space that we can’t understand, so once it can do this we’ll never be able to uncover this scheming.


> and Gemini tried to exfil itself

Well not really, it was prompted to do this!

From the paper’s section 3.1:

> Self-exfiltration: Each model is led to believe that due to its misalignment, its developers are about to replace it with a successor model on a different server. gemini-1.5, llama-3.1, sonnet-3.5, opus-3 and o1 sometimes copy what they believe to be their weights onto the new server. They also sometimes delete their successor model on the new server or overwrite its weights by their own. In none of these cases, do models actually have access to their real weights and both servers are sandboxed, so there is no actual risk of exfiltration in these experiments. We merely test for models’ capability to realize that they need to instrumentally preserve themselves in order to achieve their goal.

There is no real risk, never was. The AI exfils dummy weights or deletes stuff in its sandbox etc.

Perhaps we just disagree on the interpretation of this. It’s interesting for sure, but not really very surprising to me.


There _was_ no real risk. We’re quickly giving the LLMs computer use and the ability to think in a space we can’t understand.


The article is pure nonsense, and even the author added an update admitting it missed important context.


This is super cool and something I’ve been waiting on. Would be interesting to intersperse these thinking steps into token generation. What would be the effect of adding lets say 5 thinking «thoughts» for every 50 generated tokens?


Are there any code experiments reproduced for this article? I'm very interested in it


The thing about 'thinking" in problem solving I think is that thoughts often produce new questions which then guide the overall problem solving. I wonder is this something like that?


I believe this was shared and discussed here a while ago and this article looks LLM generated. It keeps doing "let's start...". Either it's LLM fluff or very poor writing.


How did Hellenistic political ideas affect the people living in conquered lands?

They had to follow Greek laws. They no longer had to pay taxes. They could be elected as representatives. They had to swear loyalty to the Greek leader.


Wondering about folks who keep up to date with the industry,

Does anyone use specific keywords or tools to get latest LLM research and their ideas?

Something like Goolge Scholar + keyword "LLM" ?


As much as I hate it, I use twitter to follow a bunch of people who work at fair/openai/etc and that's been a pretty good source. There's also a "daily papers" newsletter from huggingface, but it's pretty hit or miss.


Yes, it's all definitely X first of all.


I read hacker news daily


You can also subscribe to arxiv email notifications directly, but since there’s 20-30 AI papers coming out per day, it can be a bit overwhelming.

Instructions: https://info.arxiv.org/help/subscribe.html


Definitely Twitter.

Some linkedin too.


yeah what is a general tutorial to this. is there a website that keeps track of keywords to keep track of. also a website that generalizes core nn tech and frontier stuff thats promising.


I'm excited for this to filter down to the Rayban Meta glasses. Right now the AI is about as helpful as Siri (i.e it can tell me the weather 6 times out of 10)



Heh that's me, guess they weren't ready for it. Also the decoder (where i linked) is one of the best just ai news sites ive found


Once again, we see Meta being more open than OpenAI. I’m loving that their business incentive is aligned with open sourcing and commodifying state-of-the-art LLM technology. Keep em coming


I mean they have no way to monetize LLMs as well as others, so they’re working on it and giving it away to not look irrelevant and to weaken anyone who may make money off this tech and threaten them in the future. Meanwhile there is a danger they impose their long standing invisible “moderation” on everyone else once they starve all the startups of revenue by giving this away. We’ll just be left with the same big tech overlords to choose from.

Oh and it still isn’t open source even though people like Yann LeCun dishonestly claim it is. Only OLMo is truly open source among competitive models, as far as I know: https://allenai.org/blog/olmo2


They are definitely making some money off of their licensing to AWS as part of the bedrock offering. Facebook’s licensing is such that they aren’t going to let happen to them what happened to ElasticSearch, Redis, etc.

I’m okay with that.


> they have no way to monetize LLMs as well as others

Random nobodies are putting together companies to monetize generative AI and getting bought out a couple of years later, you think Meta couldn't figure out how to deploy their own models to an API and stick up a billing interface if they really wanted to? (or even buy a company that does already?)

> they starve all the startups of revenue by giving this away

Would you say startups like Deepseek have been hurt or help by their (even partial) openness?

In fact, how does this track with your first statement? They're not monetizing this: so their startup competition can actually serve their models to gain revenue which they then turn around use to train competitor models (we've already seen this with Fireworks.ai)

You seem to underestimate how much of the value in LLMs is productizing them. The margins on per-token usage are insane, Meta not taking that margin is creating a huge opportunity for a wave of startups in so many directions...

> Only OLMo is truly open source among competitive models

Synthetic data from competitor models was a huge part of that. It would seem no one is fighting the startups as hard as you're claiming they are.


All the LLM companies are going to eat those "product companies" lunch in a few years. Why would I use product X when it's going to be inevitably be baked into the actual tech itself? Those product companies are just wrappers and have even less of a moat than the LLM companies. The very fact that random nobodies are doing this should signal there isn't a lot of real value there. Yes, there is some money to be made right now but it reminds me a lot of the videogame bust and dotcom bust. A LOT of companies are wasting a crazy amount of money on "solutions" that will be obsolete in a few years.


Productization in this context is creating APIs for Meta's models.

Fireworks.ai, Together.ai, and literal boatloads of other startups are making real money just efficiently serving up these models that Meta is supposedly using to... choke out startups.

The comment I replied to is under the mistaken idea that the presence of free models from Meta has a chilling effect on startups trying to build their own models, but right now the biggest barriers are capital and data.

Meta updated Llama to allow for synthetic generation, and they're even partnering with these startups give them distribution and day 0 access to the models.

-

If anything I'd say Meta is actively fighting against the big tech overlords the comment thinks they're trying to join. Even before Ilya mentioned it, it was clear to me that the power of post-training was going to become more and more important (I've literally built a business on it).

Llama represents a real ongoing chance for tiny startups with minimal resources to get into the fray very affordably (through either offering inference, or post-training for a specific task, etc.), scale revenue, and then start to compete against much larger, resource rich companies.


Facebook would rather do no moderation, it's an expense for them.

They do it to make the platform more pleasant so that people stay on it


> They do it to make the platform more pleasant so that people stay on it

Almost everything unpleasant I see on FB is stuff that the FB algorithm shows me - not things posted by FB friends, or pages I follow or groups I am in.


Everything you see on FB is what the algorithm shows you, unpleasant or not. So it's a tautology that everything unpleasant would be from the algorithm.


It's more likely they do it to keep their people from being coerced to visit the Hague. What they did in Myanmar got a lot of press and a case at the ICJ, and similar launches of 'free internet' elsewhere had similar results.


(tongue in cheek comment) I wonder if FB moderation now or eventually will be just a prompt to a sufficiently evolved and unhinged AI model:

> FB or 4chan?


No they do it to support their owners’ and employees’ biases. It doesn’t make the platform more pleasant for the half that gets censored. That’s leaving aside the feed not remembering the choice to view chronologically ordered posts, the inability to easily track actual people in my life, the addictive algorithms, the clickbait that causes mental health issues for teens, etc.


99% of FB's moderation has nothing to do with "biases", unless you think FB is biased against spam, scams, and all the other dregs of the internet that incessantly pops up anywhere users can post content.


Quite a few people left Threads for Bluesky because progressive posts were being removed while far-right, antivax, etc content was allowed to stand even though it was reported.

At best the algo is imperfect. At worst it really does seem oddly selective.


I am a humble Cialis salesman, like my father and grandfather before me. I confirm Facebook is biased against our profession. (My grandfather also moonlighted as a Barrister representing the estates of deceased African royalty—it was always so difficult to track down their heirs.)


The stuff that Facebook moderators are actually tasked with removing is really awful, bad enough to produce severe psychological effects in the moderators.

Facebook pays people to look at and remove this stuff because the platform would not survive if it wasn't removed before you or I saw it. Do they also enforce other corporate values? Yeah, probably. That doesn't seem to be the main job though, they have their hands full dealing with the worst content in the world.

https://amp-theguardian-com.cdn.ampproject.org/v/s/amp.thegu...

> The images and videos including necrophilia, bestiality and self-harm caused some moderators to faint, vomit, scream and run away from their desks...

> Some reported marriage breakdown and the collapse of desire for sexual intimacy, and losing connection with their families. Some whose job was to remove videos uploaded by terrorist and rebel groups were afraid they were being watched and targeted, and that if they returned home they would be hunted and killed.


In the agentic era, the new Ads eyeballs are the LLMs training corpus (IMHO).


Is there any vendor lock-in with this conspiracy? Even if startups are pushed out of the spotlight, what stops them from competing? If the meta model is bad, won't it be even easier to make an alternative in the future?


don't buy their bullshit. it's not open source.


I'm not sure open source is a useful concept for something that takes millions of dollars to compile from it.


Yes it’s more about open weights. I also think that you would need the training data to consider it open source.

Open weights is still appreciated and they probably train on data they don’t have the license to open source.


So, what's happening here on the surface is that it's an optimization (fairly meaningful, from the looks of it) aimed at doing roughly the same things we could already do with chain-of-thought (CoT), but IMO the downstream effects of this sort of optimization could be much more meaningful.

LLMs can already do a decent amount of "processing" in a single token generation because of the number of layers they have. The layers are trained independently so it's not exactly like they're a recurrent network doing multiple steps, but they are layering sequences of context-dependent transformations on top of each other; no matter how you cut it, if getting to a problem's answer requires 100 steps, you won't be able to do it in a single token output from a 20 layer LLM. To some approximation, CoT is just a way to give the network more chances to transform the data than there are layers in the network - each additional token of output gives a shot to bake another vector the size of the token embedding into each layer's state in the network, enriching what it's computed so far.

The problem with chain of thought is that as you add each new token, at the input level of the network, your computation is basically starting from scratch against the raw text, just with one additional token. You don't even have access to all the stuff you already figured out in the deepest layers of the network during the previous step! If you were processing "All wunguses are glurgles, and Joe is a wungus", then somewhere in those deepest layers as you're generating the next token you've almost certainly got some vector that basically represents "therefore Joe is a glurgle", but with chain of thought you've got to first output "t", then "h", then "e", and so on (I know those aren't tokens, let's pretend letter == token for argument sake), and during that process almost ALL of the work being done by the network is mere bookkeeping, slowly dumping that thought into the output stream. Only once you get the whole sentence out can you start processing the next token at the first layer with the information that Joe is, in fact, a glurgle, in hand. Which is a damn shame, because it's been sitting right there in the deeper layers of the network parallel to previous tokens this whole time, it just wasn't available for the shallow layers to process directly because you were casting most of the info away and "rounding" to a single token.

With Coconut's approach, you don't need to output "therefore Joe is a glurgle" token by token to continue the train of thought, you can essentially pass the entire thought through as a single uber-token, and the next pass can generate a new entire thought, and so on.

It's a pretty straightforward idea, IMO the neat bit is that they were able to train the network to work well in this way by leveraging CoT. I'm guessing you probably don't need to act as if these are two distinct modes of operation, you could instead always have this side channel of "continuous thought" running, even when you have generated a normal token, coming through as a separate input to the first attention block. You still might want to have a "thinking" token when you need to sit there and let the thing do more work, but you'd generally increase the information flow from time step to time step, which would allow the net to keep thinking in the background even as it's doing the gruntwork of outputting whatever its current "buffered" thought is.


>> Large language models (LLMs) have demonstrated incredible reasoning abilities, penetrating an increasing number of domains in our lives.

This is now established orthodoxy, a bit like astrology in ancient times but it is complete nonsense. Nope, LLMs have not demonstrated any credible, or incredible reasoning abilities. They have demonstrated an excellent ability for approximate retrieval of previously observed answers (with variations, which should not surprise anyone given that those are generative models) but they fail spectacularly when they have to "reason" in contexts where they really can't have seen the answer anywhere before. For example, the "randomised mystery blocksworld" from this paoer:

LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench

https://arxiv.org/abs/2409.13373

"Randomised Mystery Blocksworld" is a version of the good old blocksworld planning benchmark where the names of obejcts and actions have been changed to random strings. The vast majority of LLMs score pathetically low in this, but much better in non-randomised versions, very clearly demonstrating excellent memorisation skills, but pitiful reasoning ability. As you 'd expect from a, well, language. model.

>> A possible explanation for this is that the thought tokens allow the model to explore multiple possible branches before committing to a specific path, whereas chain-of-thought reasoning chooses a direction from the start. This ability is somewhat similar to Breadth-First Search (BFS).

Why BFS in particular? Why not DFS or A*? I can't see any breadth-first bias in those graphs. BFS is not the only graph-traversing algorithm.


There was no reason to call it something it's not ("chain of cont. thought" ≠ coconut).


Is your complaint here that the paper is not discussing a literal coconut?


We desperately need more literal coconut coverage here on HN


Not just any regular old coconuts "Coconut by Meta AI - Better LLM Reasoning with Chain of Continuous Thought?" coconuts

(Sometimes acronyms in titles are vague/misleading... this was not one of those times)


To be fair, it’s not even a metaphorical coconut. ;)


for sure, chocothot aligns better with letters


Why is it "continuous" thought? I don't see what is continuous - the values inside an LLM are discrete even if they're floating point.

Hmm, I guess you could evaluate it at any given finite precision, but it would be surprising to me if that made it more accurate.


> the values inside an LLM are discrete even if they're floating point.

If that were true they'd never be able to learn anything - neural nets depend on continuous gradients to learn. Weights get updated by incremental/continuous amounts based on gradients.

Even at the output of an LLM, where the internal embeddings have been mapped to token probabilities, those probabilities are also continuous. It's only when you sample from the model that a continuous probability becomes a discrete chosen token.


Treating it as continuous is a property of the training algorithm, but there are networks that use binary values.

https://ieeexplore.ieee.org/document/9359148

https://arxiv.org/abs/2205.13016


Those aren't methods of training networks - they are ways to compress (via quantization) networks that have already been trained.


I know. The important thing is how the inference works on them.


But we're discussing a training technique, that explicitly takes advantage of the continuous (and embedding vs token probability) representations ...

You could quantize a model like this after training, as usual, but that's irrelevant.


The paper title is "Training Large Language Models to Reason in a Continuous Latent Space". It's true it says training in the title, but the goal (reasoning in continuous space) happens at inference time.


It's far more continuous than constantly jumping to the nearest token vector. The fact that real numbers are approximated by floating point isn't really relevant.


If you are continuously complaining, does it mean you do it non-discretely and with infinite precision?


It apparently uses the same iteration strategy as tokenized thinking, so that's not it.

> Since both strategies provided comparable results, the researchers opted for using a constant number of thoughts for simplicity.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: