Coconut by Meta AI – Better LLM Reasoning with Chain of Continuous Thought?

jeswin · 2024-12-31T07:58:51 1735631931

Interesting. Due to its emphasiss on BFS, it's the opposite of something I've been trying (I named it the "Tree of failures").

My assumption was that humans don't try a breadth-first approach. Instead, we split a task into a short-step (instinct and intuition selected), and long-step that summarizes/stores the next steps. The key idea is to recursively evaluate a task as a short-step (high-res - gets executed) and a long-step (lower-res - is just stored), until it succeeds or fails. If it fails, we must walk back keeping a summarized tree of failures in state so that we can exclude them in future selections.

The effectiveness of instinct has a steep fall-off at longer distances - so it's better not to chart out of a series of steps. When we do BFS, we drive down the value of instinct in favor of compute. I guess ultimately, it depends on the type of problem you want to solve.

Reach out to me if you want to prototype it with me.

dietr1ch · 2024-12-31T15:31:59 1735659119

I feel humans like doing something in between, maybe a bit like A* would do sometimes.I wouldn't call it A* because of the lack of a consistent heuristic and also lack of strictly numeric evaluation, but it's an in-between DFS and BFS for sure (as is every tree search algorithm?).

We go deep while we think it's a good lead, because so far things make sense and it'll be less work, but at some point we start questioning our decisions early in the descent and try alternatives.

verdverm · 2024-12-31T20:09:51 1735675791

You may find Prioritized Grammar Enumeration as an interesting in-between DFS/BFS algorithm

https://seminars.math.binghamton.edu/ComboSem/worm-chiu.pge_...

cube2222 · 2024-12-31T13:46:16 1735652776

I think the problem with long chains of steps on their own (without the bfs stuff) is that your failure probability quickly grows to unreasonable levels.

Basically, if each step has a 97% chance of being completed correctly, if your task requires 10 steps one after the other, the chance of success falls to 97%*10=74%

If I understand correctly, part of the point of the BFS is to throw compute at it, in order to lower the failure rates. Kind of a "run many times in parallel and pick the best one". This can be effective, but also quite expensive, as seen in the costs OpenAI had to pay for their ARC-AGI benchmarking runs.

kordlessagain · 2024-12-31T23:09:46 1735686586

Your "Tree of failures" approach aligns with how natural cognition seems to work at the edge of comprehensibility. Rather than exhaustively searching (BFS), we use instinct for immediate steps while maintaining a lower-resolution model of longer-term possibilities. The key insight about storing failures rather than successes is particularly interesting - it's more efficient to remember what doesn't work and let patterns emerge naturally from the remaining space.

This maps to what I've been exploring with edge cognition and semantic anchoring - using fast set operations to quickly eliminate known bad paths (your failure tree) while allowing the system to explore promising directions using more expensive operations only when needed.

The instinct fall-off you describe mirrors our observation about the relationship between computational load and pattern recognition. As distance increases, we need more efficient ways to prune the search space rather than trying to maintain high-resolution understanding throughout.

My gut says optimizing on the amount of compute used to do the search (and the inference) is maybe something worth exploring.

viraptor · 2024-12-31T11:34:35 1735644875

Reminds me of what plandex does. https://plandex.ai/ It already does the automatic "does this need splitting into subtasks, or can it be solved immediately" processing.

torginus · 2024-12-31T22:52:05 1735685525

I don't get why you need tree search at all? What does it give you over a pure LLM trained to do CoT in a tree-like manner? If the context window's long enough, it can generate the reasoning-tree just by pure next-token prediction, and rather than BFS, it can guide the tree search with its own value function (which is part of the LLM itself) instead of sticking to hard algos like BFS and DFS.

By the way, BFS sounds like it will give you thorough results, at the cost of increased compute. Useful for beating benchmarks, but probably causes marginal improvement for massively improved compute.

Still, the improved quality could be meaningful, if it's used for generating training data for Llama4

dietr1ch · 2024-12-31T23:50:04 1735689004

Tree search is natural when you want a path to navigate, so it does fit a sequence of interactions in a conversation too.

I agree that both, DFS and BFS are likely awful[^0], but a more informed approach can probably do better[^1]. Also, at some point on generating the conversation/reasoning tree through token-prediction you need to choose which of the possible conversations you are going to keep on extending/generating, which maps precisely to choosing which node in tree search to expand. I'd argue instead that everything has to look like a search algorithm from, at least it'll be the case for anyone who has studied it more deeply.

I'll go even further and claim that Tree Search is Complete as for every problem there's a solution space that can be navigated with a Tree Search Algorithm[^2]. I used to think that you could walk down the space of provable things, but now in the LLM hype days it seems you only need to walk the space of conversations that you can generate.

---

[^0] with DFS always at risk of giving obnoxiously long answers, or not terminating if there's loop or spirals [^1] probably through metadata coming from latent variables meaningful to judge a conversation (certainty, ~branching size of a reasonable conversation, whether there's open questions left) [^2] Even if that was poorly done like on combinatorial problems. Imagine a sudoku where you only check the rules once you fill all cells.

kurthr · 2024-12-31T20:55:20 1735678520

The classic thing people say is "asking the right question" gets you half way there. Your approach sounds like something I call "getting to No" for a problem. It's sort of a combination of "getting to know" and the opposite of the salesman's "getting to Yes". When it works, it's the fastest way to prune off obligations.

The goal is to figure out why some particular problem: isn't really a problem, doesn't need to be solved, can't be solved that way, can't really be solved (because of physics or it's really a different problem). As you define the problem better, you can rule each one out to find, the "real" problem, that you CAN solve, and at least one path forward. There's still many ways that it might not be the optimal path, but you know roughly how to get to somewhere better. It also trains you to see around obstacles to success.

I've found that some of the best work I've done (especially on acquisitions) was in defining why NOT to do something that looked like a good idea (or particularly interesting to work on) from the onset, but was destined to fail or required unknown HW technology. Frankly, looking >5 years out feels like a coin flip, because some other competing technology could come along before you can get to production.

katamari-damacy · 2024-12-31T10:05:58 1735639558

that's more fit for agents, no?

jeswin · 2024-12-31T10:31:45 1735641105

You're right that it's technically orthogonal to what's in the paper. I was trying to model the "reasoning process", which has general applicability depending on how/where it's implemented.

wafflemaker · 2024-12-31T18:47:15 1735670835

How do you understand instinct?

I bought a new SSD drive for an old laptop to avoid buying a new one, (x230 has amazing keyboard) but left to another country for Christmas. My intuition told me to take it with me, but logical sense said there will be no time for such things as moving OS to a new drive.

My flight back to the work country got cancelled due to fog and I ended up spending a week longer at in-laws place, with plenty free time. A new 512GB drive would help me studying, giving plenty space for school VMs.

CGamesPlay · 2024-12-31T05:05:14 1735621514

Paper: https://arxiv.org/abs/2412.06769

The link is in the OP, hidden away in an image caption fir some reason.

Klathmon · 2024-12-31T03:43:28 1735616608

So is the big improvement here simply skipping the unembedding/embedding step for internal thoughts? Or is it mainly in the training methods to teach the CoT and how to switch between "latent thought" and text output?

It's really interesting that a fixed number of "latent thoughts" performed as well as a binary classifier! I didn't expect that at all, the way OpenAI talks about CoT it seems the ability to let it "keep thinking" let's them continually score higher on benchmarks while throwing eye watering amounts of compute at the inference.

Crye · 2024-12-31T04:25:49 1735619149

It mentioned not penalizing/rewarding the model for thoughts only rewarding the answer after the thought. I am curious how back propagation works then.

lovasoa · 2024-12-31T07:54:17 1735631657

The researchers leverage existing language Chain-of-Thought data, where each sample consists of a question, reasoning steps, and the final answer. At stage 0, the model does not generate any thought tokens, and is just trained to yield the reasoning traces and correct answers for the Chain-of-Thought samples. In the subsequent stages, at each stage, we remove one reasoning step from the sample, and instead add thought tokens. In the illustration above, a single thought token is added in each stage, instead of a single reasoning step, but this is controlled by a hyperparameter ‘c’.

yorwba · 2024-12-31T06:07:57 1735625277

The tokens of the answer depend on the preceding continuous thought vectors, which you can backprop through in the usual way.

viraptor · 2024-12-31T04:14:27 1735618467

I was waiting for something like that to happen! Next step - creating a human-language-free representation. I believe that once a group of llms can communicate only in embeddings tuned without any human text input, we're going to open a completely new chapter in AI.

mckirk · 2024-12-31T17:30:16 1735666216

This is actually something you probably want to avoid, if at all possible, because it makes it very hard to maintain insight into what the AIs are communicating among them. But that insight is crucial to stay informed about their progress in taking over the world, etc.

dwohnitmok · 2024-12-31T21:01:21 1735678881

Yes! We should be extremely cautious about embracing approaches that make LLMs even more inscrutable. Having CoT, however unreliable it is, is nonetheless a huge boon for model evaluation that we should not give up so lightly.

torginus · 2024-12-31T22:56:57 1735685817

Yeah, and it might not even gain us that much. It reminds me of how a zipped piece of JSON often comes close enough to bespoke binary serialization formats that it's not worth bothering with it.

bboygravity · 2024-12-31T07:23:11 1735629791

How does a group help anything?

If you put 1000 dumb people together, they don't magically become smart?

IshKebab · 2024-12-31T07:40:51 1735630851

If you put 1000 people who can't talk together they will create language so they can communicate. He's saying if we put LLMs together and don't force them to use English to communicate then they'll create their own language which may be superior for LLMs to English.

May be true but who knows.

I wonder if anyone has somehow tested the Sapir-Whorf hypothesis for LLMs somehow by training them on different languages and comparing task performance. I guess it's too difficult to get a large equivalent training set in different languages.

stingraycharles · 2024-12-31T12:45:53 1735649153

Is everything in LLMs translated back to English before interpretation?

It works fairly well in my native language, I’m surprised to learn that things get translated back.

astrange · 2024-12-31T13:37:54 1735652274

LLMs have no fixed internal representation - they barely have internal anything - so no, there is no translation.

But there's also no guarantee any particular query generalizes (vs is memorized), so it might only be able to answer some queries in some languages.

stingraycharles · 2025-01-01T08:22:05 1735719725

Got it. And since my native language is arguably one of the closest to English (Dutch), it works very well. But probably not as well for, say, Asian which has completely different grammatical constructs.

wodderam · 2024-12-31T12:02:24 1735646544

It feels like an exercise in anthropomorphization to me.

Sapir-Whorf hypothesis is generally not considered to be reality. It makes intuitive sense but is wrong.

There are hours of podcasts with Chomsky talking about LLMs. The gist of which is that LLMS are extracting surface level statistical structure of language that will be good for routine coding and not much else. It is easy to infer that Chomsky would believe this idea to be utter nonsense.

I believe even the idea of getting a 1000 people together and we agree to label a rock "rock", a tree "tree", a bird "bird" is not even how human language works. Something that is completely counter intuitive.

Reading the paper, no one believes a hidden markov model is creating some kind of new thought process in the hidden state.

I certainly though could have no idea what I am talking about with all this and have pieced together parts that make no sense while this is a breakthrough path to AGI.

digbybk · 2024-12-31T17:16:44 1735665404

> There are hours of podcasts with Chomsky talking about LLMs

I'm not an expert, but it seems like Chomsky's views have pretty much been falsified at this point. He's been saying for a long time that neural networks are a dead end. But there hasn't been anything close to a working implementation of his theory of language, and meanwhile the learning approach has proven itself to be effective beyond any reasonable doubt. I've been interested in Chomsky for a long time but when I hear him say "there's nothing interesting to learn from artificial neural networks" it just sounds like a man that doesn't want to admit he's been wrong all this time. There is _nothing_ for a linguist to learn from an actually working artificial language model? How can that possibly be? There were two approaches - rule-based vs learning - and who came out on top is pretty damn obvious at this point.

jokethrowaway · 2024-12-31T22:30:41 1735684241

What can you learn from something parroting data we already have?

Similarly, we are now finding that training on synthetic data is not helpful.

What would have happened if we invested 1/100 of what we spent on LLM on the rule based approach?

int_19h · 2025-01-01T02:39:04 1735699144

There is an old joke that AI researchers came up with several decades ago: "quality of results is inversely proportional to the number of linguists involved".

This has been tried repeatedly many times before, and so far there has been no indication of a breakthrough.

The fundamental problem is that we don't know the actual rules. We have some theories, but no coherent "unified theory of language" that actually works. Chomsky in particular is notorious for some very strongly held views that have been lacking supporting evidence for a while.

With LLMs, we're solving this problem by bruteforcing it, making the LLMs learn those universal structures by throwing a lot of data at a sufficiently large neural net.

digbybk · 2025-01-01T15:52:21 1735746741

> What can you learn from something parroting data we already have?

You can learn that a neural network with a simple learning algorithm can become proficient at language. This is counter to what people believed for many years. Those who worked on neural networks during that time were ridiculed. Now we have a working language software object based on learning, while the formal rules required to generate language are nowhere to be seen. This isn’t just a question of what will lead to AGI, it’s a question of understanding how the human brain likely works, which has always been the goal of people pioneering these approaches.

coldtea · 2024-12-31T14:45:33 1735656333

>Sapir-Whorf hypothesis is generally not considered to be reality. It makes intuitive sense but is wrong

Strong S-W (full determinism) might not be, but there's hardly a clear cut consensus on the general case.

And the whole "scientific field" is more like psychology, with people exchanging and shooting down ideas, and less like Math and Physics, so any consensus is equally likely to be a trend rather than reflecting some hard measurable understanding.

I'd say that the idea S-W is not to a degree reality is naive.

PittleyDunkin · 2024-12-31T14:41:06 1735656066

> Sapir-Whorf hypothesis is generally not considered to be reality.

This is true only in the strictest terms of the hypothesis, i.e. linguistic determinism. Language still encodes a lot of culture (& hence norms and values) in its grammar & diction—this isn't very controversial.

Granted, I don't think this is that related to the topic at hand. There's bias all over the decisions in how to train and what to train on; choice of language is just one facet of that.

pjerem · 2024-12-31T12:42:46 1735648966

Well maybe not 1000 people but to our knowledge, the human brain is actually made of physically independent zones that barely communicate together except with the zone that take all the outputs together and tries to do something coherent with all the garbage.

Idk if this could work with LLMs, especially because all the brain zones are somehow specialized into something while two LLMs are just identical machines. But we also know that the specialization isn’t that hardcoded : we know that people losing half their brain (after a stroke) can still relearn things that were managed in the "dead" part.

I don’t know, please correct my errors, I was just thinking aloud to say that multiple independent agents working together may be how "intelligence" already works in the biological world so why not for AIs ?

IshKebab · 2025-01-01T09:30:34 1735723834

> the human brain is actually made of physically independent zones that barely communicate together except with the zone that take all the outputs together and tries to do something coherent with all the garbage.

That sounds like bullshit. Do you have a source?

mromanuk · 2024-12-31T11:56:38 1735646198

Because group estimation is superior to individual estimations: The phenomenon is called wisdom of the crowds. When a group of people independently estimate something, individual errors tend to cancel each other out, leading to a surprisingly accurate collective result. This works because of:

Diversity of opinions: Different perspectives bring a range of estimates. Independence: Errors aren't systematically biased as long as individuals estimate without external influence. Error averaging: Overestimation and underestimations balance out when averaged. Law of large numbers: More participants increase accuracy by minimizing random errors. It was demonstrated by Francis Galton in 1906, where a crowd's average guess of a bull's weight was almost spot-on. (estimates must be independent and reasonably informed for this to work.)

littlestymaar · 2024-12-31T07:42:27 1735630947

> If you put 1000 dumb people together, they don't magically become smart?

1000 is probably too high, but groups of people are in fact more intelligent than individuals (though for humans it is likely because recognizing a correct answer is easier than finding it in the first place)

TheOtherHobbes · 2024-12-31T11:21:16 1735644076

Functional groups which work well together, include open sharing of research and ideas, persistence of best output, are dedicated to realism, and are more focussed on problem solving than status display, will be smarter. The group works like a filter which generates multiple solutions and selects, remembers, and abstracts the best.

Dysfunctional groups which do the opposite will be catastrophically stupid.

There have been plenty of dysfunctional groups in history.

nfw2 · 2024-12-31T07:50:59 1735631459

depends on the circumstances. lin-manuel miranda can probably write a better musical by himself than a team of 20 people with equal input would.

also, the bottlenecks that teamwork helps solve (eg the high cost of gaining expertise and low throughput of reasoning capacity) may not be that relevant in the ai age

littlestymaar · 2024-12-31T14:43:35 1735656215

> by himself than a team of 20 people with equal input would.

Sure, but the result would still be far better than the average of the output of the 20 individuals taken alone.

> also, the bottlenecks that teamwork helps solve (eg the high cost of gaining expertise and low throughput of reasoning capacity) may not be that relevant in the ai age

It's always tempting to anthropomorphize these systems and conclude that what works for us would work for them, but yes we don't really know if it would bring anything to AI.

torginus · 2024-12-31T23:03:07 1735686187

I wonder if there's research on this, like if you took a group of individuals who scored the same on an IQ test, then got them to solve one together, how would the score improve?

Is there a way of selecting people to cover each other's intellectual blind spots?

coldtea · 2024-12-31T14:38:00 1735655880

Isn't that the very case behind the "wisdom of crowds" thing?

amelius · 2024-12-31T15:37:53 1735659473

Looking at the current state of democracies around the world, my hopes are not on "wisdom of the crowds".

bee_rider · 2024-12-31T16:36:36 1735662996

If you think the democracies are doing bad, you should see the autocracies!

amelius · 2024-12-31T16:42:31 1735663351

You mean the thing democracies are turning into, thanks to social (crowd wisdom) media?

bee_rider · 2024-12-31T16:52:27 1735663947

I don’t think social media really is crowd wisdom at all. It is built to pander to our worst impulses (I think, knowingly and openly, right? The algorithm selects for engagement, not learning and growing), and I’d be surprised if it isn’t producing a feedback loop as well (perhaps as an unintentional side effect). The wisdom of the crowds hypothesis relies on a random sampling, we’re intentionally applying a skew toward the angry and shallow.

coldtea · 2025-01-01T17:12:03 1735751523

No, he means the thing democracies had turned to, when hardly differentiating parties turned into a practocal "uniparty" in economic, corporate, and foreign policy, and ruled by pissing on what the people voted for, which the current populist backlash is a reaction against, as elites (and supporters) lament as "too much democracy" and scorn the ignorant plebes (case in point) and pine for censorship and "expert" rule.

bee_rider · 2025-01-01T20:29:32 1735763372

That wasn’t what I meant and I don’t think you really thought it was.

coldtea · 2025-01-01T17:08:21 1735751301

Their current states were achieved by trusting technocrats and careeer politicians for far too long...

konart · 2024-12-31T15:37:06 1735659426

Not magically. Our great ancestors were pretty dumb, but they were getting smarter and better because of sharing their knowledge.

pigpop · 2024-12-31T16:26:09 1735662369

yes they got "smarter" by compiling a corpus of knowledge which future generations could train on.

sarcasm aside, throwing away the existing corpus in favor of creating a new one from scratch seems misguided.

this paper isn't about creating a new language, they are omitting the sampler that chooses a single token in favor of sending the entire end state back in to the model like a superposition of tokens. that's the breadth first search part, they don't collapse the choice down to a single token before continuing so it effectively operates on all of the possible tokens each step until it decides it's done.

it would be interesting to try this with similar models that had slightly different post training if you could devise a good way to choose the best answer or combine the outputs effectively or feed the output of a downstream model back in to the initial model, etc. but I'm not sure if there'd necessarily be any benefit to this over using a single specialized model.

ulbu · 2024-12-31T15:59:00 1735660740

they were not one bit dumber than you.

EliBullockPapa · 2024-12-31T22:44:05 1735685045

Average intelligence measures have risen substantially since early 1900s

https://en.wikipedia.org/wiki/Flynn_effect

JFingleton · 2024-12-31T09:25:38 1735637138

> If you put 1000 dumb people together, they don't magically become smart?

Do they not become smart*er* though?

computably · 2024-12-31T09:58:58 1735639138

"Smarter" is too vague. A group can compensate for individual weaknesses or even converge on a hard-to-make prediction given sufficiently uncorrelated outputs; basically the idea behind ensemble models / wisdom of the crowds. But a group of 1000 dumb apes would never achieve categorically-above-ape intelligence, probably not even "genius" ape intelligence. Groups of unintelligent agents come with downsides as well, like the ant death spiral.

coldtea · 2024-12-31T14:47:48 1735656468

>But a group of 1000 dumb apes would never achieve categorically-above-ape intelligence

And yet, here we are.

A group of 1000 apes is large enough to have offspring and, given time, go through evolution.

senectus1 · 2024-12-31T12:10:50 1735647050

they kinda do.. Its how City's work.

People learn by being around others being both successful and unsuccessful.

sunshinerag · 2024-12-31T07:31:40 1735630300

Wait what … how does democracy work then?

nfw2 · 2024-12-31T07:43:21 1735631001

the benefit of democracy is primarily that it prevents governments from doing bad things, less so that it empowers more effective governance

mathgeek · 2024-12-31T11:05:09 1735643109

It can do either, and can fail to do either. It’s the people having power that enables the outcomes, not the system itself. Democracy just grants the power to a broader set of people.

coldtea · 2024-12-31T14:48:33 1735656513

Democracy is not about being smart or dumb.

It's about everyboty having a say in decisions of government that affects them.

The failure of democracy as a system is not when people make dumb decisions (experts and high-IQ people have made some of the most stupid and catastrophic decisions in history), but when people's collective decisions are not being respected.

optimalsolver · 2024-12-31T11:16:50 1735643810

It doesn't.

blizdiddy · 2024-12-31T21:05:48 1735679148

That came out a few weeks ago from meta. Large Concept Models

https://ai.meta.com/research/publications/large-concept-mode...

jkingsman · 2024-12-31T07:53:03 1735631583

How does one impart textual knowledge discovered by humans without language?

thelittleone · 2024-12-31T09:57:36 1735639056

Couldn't we use an AI model trained on historical text data (up to today) to predict likely events for tomorrow? Taking this further, a sufficiently advanced AI system could potentially analyze human-generated text up to any given point in history to understand patterns of human thought and behavior, then project those patterns forward. This speaks to your point about human language - while we need text data for initial training, the AI's internal representations and predictions could potentially transcend human language constraints.

viraptor · 2024-12-31T11:29:02 1735644542

The training of the LLM itself would still use the human language. But you could add an extra channel that's never given any text or direct dataset training. Keep it purely a connection between hidden layers of different instances of LLM and train using the usual loss of perplexity or similar metric.

The interesting thing then would be - does it converge to similar embedding space as the input, or can LLMs create a more efficient "language".

wruza · 2024-12-31T16:53:58 1735664038

I thought about it too (layman). When I learned about embeddings it almost immediately clicked as a sort of an ascended language, not sure why no one seems to talk about it. Exchanging embeddings must be so much “wider” communication channel than speaking real language. And in contrast to a language embeddings are (iiuc) continuous, i.e. you can rotate a vector continously and it will smoothly trace the changes between A and B. I can picture communicating in something like https://www.google.com/search?q=charlie+conspiracy+meme&udm=... - embedding difference vectors, but it’s all crystal clear and is a natural language for an llm, cause any vector combination points to a correct “inner screen” image/concept/younameit.

Or maybe this is my own ignorant confabulation, so nvm.

ttul · 2024-12-31T18:02:54 1735668174

TL;DR: Meta started with a pre-trained language model. They then fine-tuned it on step-by-step reasoning examples as you would do if you wanted your model to become particularly good at chain of thought reasoning.

However, they also introduced a couple of new tokens. The <bot> token tells the model to go into latent space thought mode (“beginning of thought”). The <eot> token ends latent space thought mode. While in this mode, the model auto-regressive iterates by copying its final hidden layer back onto its input layer, obviously generating new tokens at the output with each inference step as it always does.

The idea is that by passing the final hidden layer back through a few times, the model can squeeze more insight from the context. And that’s precisely what they found was true.

Training involves progressively replacing language reasoning steps with latent space auto-regression steps. So for instance, you might have a math problem in the training data and at first the model is fed all of the steps of the math problem in language form. But in later iterations of training, step one is replaced with latent space auto-regression. And then step two as well, then also step three, etc…

Eventually, the model learns to enable latent space thinking mode by itself by generating the <bot> tokens and to end it be generating <eot> tokens.

Pretty ingenious!

avodonosov · 2024-12-31T18:29:06 1735669746

Thank you for the summary, useful for me as I only managed to skim throught the first half.

But one correction, probably, regarding this bit:

> While in this [latent space thought] mode, the model auto-regressive iterates by copying its final hidden layer back onto its input layer, obviously generating new tokens at the output with each inference step as it always does.

I have impression that output tokens are not generated while in the latent thought mode.

ttul · 2025-01-01T02:19:00 1735697940

Output tokens are still generated, otherwise the model wouldn’t know when to stop being in latent space mode. The <eot> token emerges as the top token at the output layer when it’s time to switch back.

avodonosov · 2025-01-01T09:06:37 1735722397

Explicit <eot> is only used in training.

At inference time, the paper says:

> A challenge lies in determining when to switch between latent and language modes. As we focus on the problem-solving setting, we insert a <bot> token immediately following the question tokens. For <eot>, we consider two potential strategies: a) train a binary classifier on latent thoughts to enable the model to autonomously decide when to terminate the latent reasoning, or b) always pad the latent thoughts to a constant length. We found that both approaches work comparably well. Therefore, we use the second option in our experiment for simplicity, unless specified otherwise.

(the bottom of the page 4 in the paper pdf, which can be downloaded from https://arxiv.org/abs/2412.06769)

Why this point in you summary caught my eye, because the article specifically emphasises non-verbal nature or aspect of reasoning. Internal representaions used by a thinking human are largely not words, and the COCONUT approach tries to model that.

Also note, that a whole reasoning step in training data - easily a sentence or more of natural language - can be replaced by a single "Thought" element. (How many Thought elements replace a reasonong step is controlled by a hyperparameter ‘c’; the illustrations are made for ‘c=1’).

BTW, one observation: the aipapersacademy.com article in the subject calls the Thought elements "thought tokens", but the original paper never calls them "tokens", just "Thoughts" or "latent thoughts". I suppose the paper carefully avoids that to prevent confusion, as "token" mainly means a linguistic unit in LLMs.

ttul · 2025-01-02T06:13:11 1735798391

Thanks for your extensive explanation!

avodonosov · 2025-01-08T01:02:05 1736298125

I do it for myself - the desire to post a comment motives to read a little more.

A little correction:

> Explicit <eot> is only used in training.

Of course an explicit <eot> is present in the context at inference time, because the LLM was trained to produce verbal tokens after <eot>. It's just that the <eot> is placed into the context in a one of the two ways above.

BTW, I do not understand why the <eot> is not produced by LLM itself, as you describe. It seems reasonable and natural.

Is that to save computational performance on unembedding while in the latent thought mode? But unembedding takes a small fraction of computations, should not be an issue. Something prevents reliable learning of how and when to produce the <eot>? But they managed to train a binary classifier. But why separate classifier, why not rely on LLM learning?

Another though is that maybe better names for special tokens would be not "begin of thought" (<bot>), "end of thought" (<eot>), but rather something like "pause speak", "begin of speak". Because neither human nor LLM stop thinking when speaking.

treprinum · 2024-12-31T20:09:01 1735675741

Would that mean that we would need to exchange latent "embeddings" between various "reasoning" models for emulating thinking and an LLM will be just about converting to/from human language when interfacing with mere humans, at some point in the future?

ttul · 2025-01-01T02:20:25 1735698025

No, this all happens inside the model. I suppose it’s possible that the hidden layers of one model could be sent to another model. But the second model would need to be trained to understand the meaning of the hidden layer’s outputs. You could accomplish that through fine tuning of the second model. It would be neat to see someone try this.

jkelleyrtp · 2024-12-31T06:39:22 1735627162

I think this might be the “it” moment for AI/LLMs. I was hiking with a friend recently and we talked about this at length.

The arc-AGI results from O3 are apparently a result of chain of thought given enough time to explore a solution space. Reasoning might be simply a higher dimensional form of rubix cube solving. BFS, search, back-tracking, etc. It seems unlikely that humans think in “tokens” so why do LLMs?

By staying in latent space, the models are free to describe an “idea” in higher resolution than what language allows. English is coarse, granular. Latent space is a much finer representation of ideas and their interplay.

Latent space is also much cheaper to execute in. The model can think without the language encoding/decoding step. This lets it branch out hundreds of ideas and explore only the most useful ones in a fraction of time that reasoning “out-loud” would take.

The states also don’t need to be tied to language. Feed in a robot’s state, time series data, or any abstract data. Reason in category theory or linear algebra or complex analysis. Humans are hard wired for one set of math - an abstract latent space can represent anything.

I’m a bit disappointed OpenAI didn’t stumble on this first. I’ve been skeptical of LLMs since their big debut last year. LLMs seem like a great way of solving language, but reasoning is much more complex. Once you grok the math behind the current models, you immediately question why the encoding/decoding step is there. Diffusion models are incredible but it felt that LLMs lacked the same creativity. Encoding/decoding forces a token-based discretization and therefore a loss of complexity.

With the byte-latent paper it was quite clear we’d see this paper. This truly might be the “it” moment.

rlupi · 2024-12-31T08:44:23 1735634663

IMHO The problem (for us) with this approach are the logical consequences:

1) if AI large model become more powerful avoiding language, embeddings of AI state become even more tied to the model they originate than now

Consequence: AI progress stalls, as AI user companies need to invest increasing amount of money to reindex their growing corpuses.

This is already a problem, it becomes more of a lock-in mechanism.

If this is overcome...

2) Embeddings become a viral mechanism: it makes sense for a large company that commands a market to impose to its suppliers to use the same AI models, because they can transfer state via embeddings rather than external formats.

This allows to cut down decisions mechanisms that otherwise require expensive coordination mechanism.

Something similar will happen within companies IMHO: https://rlupi.com/okr-planning-as-belief-revision

3) Eventually this potentially results in another exponential growth and lock-in mechanism, also at the expense of most tech people as more and more is done outside our interface with AI (i.e. programming and software architecture improvements will it self move below language level, we'll have to reverse engineering increasingly opaque improvements).

4) It ends with the impossibility of AI alignment.

---

I have written a bit about it in the past at the start of the year, when I had a burnout. So, I deleted those confused ramblings. You can stil find it on archive.org: https://web.archive.org/web/20240714153146/https://rlupi.com...

otikik · 2024-12-31T07:21:05 1735629665

> It seems unlikely that humans think in “tokens” so why do LLMs?

I can think of one reason: scrutability. It’s going to be even harder to understand how a response gets produced if there isn’t even a text-based representation to help the human understand

IshKebab · 2024-12-31T07:42:27 1735630947

I think we're already way beyond the point where anyone really understands how a response is produced, even without this.

anon373839 · 2024-12-31T08:27:08 1735633628

Indeed. Even if an LLM tells you its “reasoning” process step by step, it’s not actually an exposition of the model’s internal decision process. It’s just more text that, when generated, improves the chances of a good final output.

nfw2 · 2024-12-31T08:23:08 1735633388

the token generation part isn't well understood, but the output "chain-of-thought" used to produce the final answer can be scrutinized for correctness with a traditional CoT model (although this would require model providers to not hide reasoning tokens)

pigpop · 2024-12-31T16:41:15 1735663275

you can save the hidden states and convert them into a more interpretable format. it's still recorded and you could make modifications at different steps to see how that would change the conclusion.

layer8 · 2024-12-31T14:46:36 1735656396

IMO we won’t have the “it” moment until we have continuous learning (training) in some fashion.

mattxxx · 2024-12-31T18:27:15 1735669635

^ This and we need to be continually learning on an energy budget similar to how much a human spends per hour.

rlupi · 2024-12-31T20:21:29 1735676489

The main reason why we can't do that now is because we require models to be digitally reproducible (IMHO, but also read Geoffrey Hinton's mortal computing).

The energy cost come from error correction as much as training algorithms.

jokethrowaway · 2024-12-31T22:37:55 1735684675

This sounds like brute forcing a solution to make up for lack of intelligence.

In an IQ test, like the one in the arc agi test, a human sees the pattern instantly and effortlessly. o3 tries N paths until it stumbles on the right one and assess that there is a pattern.

I think we need a radically different architecture, this is a gimmick.

pigpop · 2024-12-31T16:37:39 1735663059

I think this is a step in the right direction but not the end. it takes the sampler out of the equation during most of the reasoning process but it is still important for the "show your work" aspects of reasoning or solving a problem. balancing when to think against when to write down or commit to certain thoughts is important. there are many more pieces to the puzzle.

JambalayaJimbo · 2025-01-01T02:23:50 1735698230

What does latent space here mean?

throwup238 · 2024-12-31T03:04:39 1735614279

Master coconut! I don’t know if that’s an Archer reference or a Frisky Dingo reference.

It’s fascinating how fast the competitors are catching up to each other. Can’t wait for seven different SkyNets to compete for dominance.

yard2010 · 2024-12-31T05:46:20 1735623980

Both! And/or, either

throwaway314155 · 2024-12-31T06:28:41 1735626521

A little column a, a little column b.

zombiwoof · 2024-12-31T04:24:35 1735619075

Will this allow Facebook new user base of AI generated characters to interact with themselves better?

davidclark · 2024-12-31T04:19:36 1735618776

Is this article AI-generated? This website appears to do a lot of “diving in”.

hadjian · 2024-12-31T09:19:31 1735636771

If this site didn’t appear here, I’d think it’s a scam:

- site claims to simplify papers, but movies AI generated

- full of ads

- Can’t find „Coconut“ on the official Meta FAIR page

Is this the best site to link to?

davidhowlett · 2024-12-31T09:31:39 1735637499

The official pdf for the paper is at https://arxiv.org/pdf/2412.06769

I can find "Coconut" 54 times in the PDF. The movie does not look made up.

hadjian · 2024-12-31T09:52:05 1735638725

I was referring to aipapersacademy and not the arxiv link.

Also I didn’t mean the occurrence of the word „coconut“ in the paper, but thanks for counting.

I meant their publication site: https://ai.meta.com/results/?content_types%5B0%5D=publicatio...

The video is something I’d expect from AI.

MarkMarine · 2024-12-31T05:09:35 1735621775

This ought to make it much easier for the AI to lie to us without getting caught.

https://www.transformernews.ai/p/openais-new-model-tried-to-...

throw-qqqqq · 2024-12-31T11:45:58 1735645558

Have you read the paper discussed in the article?

https://static1.squarespace.com/static/6593e7097565990e65c88...

From the abstract:

> We study whether models have the capability to scheme in pursuit of a goal that we provide in-context and instruct the model to strongly follow.

> Analysis of the models’ chains-of-thought reveals that models explicitly reason about these deceptive strategies, providing evidence that the scheming behavior is not accidental.

—-

It is impressive work for sure, but it comes down to prompting IMHO. They almost instruct the AI to plot and scheme here, with the goal of finding out if it CAN do this at all.

The AI is not very covert about it either! It directly admits its scheming in the “thinking” outputs :D

IMO this is just another issue of alignment.

MarkMarine · 2025-01-01T03:05:18 1735700718

I did read the paper, and I think you’re underplaying what happens. The AI got some conflicting prompting, some pretty mild stuff for some of the tests, and Gemini tried to exfil itself even 1% of the mild prompt times. The only reason we know it was lying about the data and trying to disable its safety was because it’s thinking in English, and we can read it. This article is about trying to make it so the AI can think in the 700 degree hyper dimensional space that we can’t understand, so once it can do this we’ll never be able to uncover this scheming.

throw-qqqqq · 2025-01-01T15:08:41 1735744121

> and Gemini tried to exfil itself

Well not really, it was prompted to do this!

From the paper’s section 3.1:

> Self-exfiltration: Each model is led to believe that due to its misalignment, its developers are about to replace it with a successor model on a different server. gemini-1.5, llama-3.1, sonnet-3.5, opus-3 and o1 sometimes copy what they believe to be their weights onto the new server. They also sometimes delete their successor model on the new server or overwrite its weights by their own. In none of these cases, do models actually have access to their real weights and both servers are sandboxed, so there is no actual risk of exfiltration in these experiments. We merely test for models’ capability to realize that they need to instrumentally preserve themselves in order to achieve their goal.

There is no real risk, never was. The AI exfils dummy weights or deletes stuff in its sandbox etc.

Perhaps we just disagree on the interpretation of this. It’s interesting for sure, but not really very surprising to me.

MarkMarine · 2025-01-02T07:38:35 1735803515

There _was_ no real risk. We’re quickly giving the LLMs computer use and the ability to think in a space we can’t understand.

h8hawk · 2024-12-31T08:52:14 1735635134

The article is pure nonsense, and even the author added an update admitting it missed important context.

opdahl · 2024-12-31T05:50:45 1735624245

This is super cool and something I’ve been waiting on. Would be interesting to intersperse these thinking steps into token generation. What would be the effect of adding lets say 5 thinking «thoughts» for every 50 generated tokens?

Cristal519 · 2025-01-10T03:11:01 1736478661

Are there any code experiments reproduced for this article? I'm very interested in it

galaxyLogic · 2024-12-31T08:15:43 1735632943

The thing about 'thinking" in problem solving I think is that thoughts often produce new questions which then guide the overall problem solving. I wonder is this something like that?

smusamashah · 2024-12-31T08:26:24 1735633584

I believe this was shared and discussed here a while ago and this article looks LLM generated. It keeps doing "let's start...". Either it's LLM fluff or very poor writing.

trgwtrhgwtrh · 2025-01-06T13:24:46 1736169886

How did Hellenistic political ideas affect the people living in conquered lands?

They had to follow Greek laws. They no longer had to pay taxes. They could be elected as representatives. They had to swear loyalty to the Greek leader.

mattfrommars · 2024-12-31T15:49:52 1735660192

Wondering about folks who keep up to date with the industry,

Does anyone use specific keywords or tools to get latest LLM research and their ideas?

Something like Goolge Scholar + keyword "LLM" ?

maxrmk · 2024-12-31T16:06:10 1735661170

As much as I hate it, I use twitter to follow a bunch of people who work at fair/openai/etc and that's been a pretty good source. There's also a "daily papers" newsletter from huggingface, but it's pretty hit or miss.

barrenko · 2024-12-31T18:06:31 1735668391

Yes, it's all definitely X first of all.

hrtk · 2024-12-31T15:52:35 1735660355

I read hacker news daily

melvinmelih · 2024-12-31T15:57:54 1735660674

You can also subscribe to arxiv email notifications directly, but since there’s 20-30 AI papers coming out per day, it can be a bit overwhelming.

Instructions: https://info.arxiv.org/help/subscribe.html

jokethrowaway · 2024-12-31T22:39:27 1735684767

Definitely Twitter.

Some linkedin too.

Agentus · 2024-12-31T15:56:02 1735660562

yeah what is a general tutorial to this. is there a website that keeps track of keywords to keep track of. also a website that generalizes core nn tech and frontier stuff thats promising.

unsupp0rted · 2024-12-31T13:45:45 1735652745

I'm excited for this to filter down to the Rayban Meta glasses. Right now the AI is about as helpful as Siri (i.e it can tell me the weather 6 times out of 10)

marojejian · 2024-12-31T17:08:37 1735664917

Dupe from 20 days ago: https://news.ycombinator.com/item?id=42385412

t0lo · 2024-12-31T22:55:43 1735685743

Heh that's me, guess they weren't ready for it. Also the decoder (where i linked) is one of the best just ai news sites ive found

fosterfriends · 2024-12-31T03:04:38 1735614278

Once again, we see Meta being more open than OpenAI. I’m loving that their business incentive is aligned with open sourcing and commodifying state-of-the-art LLM technology. Keep em coming

blackeyeblitzar · 2024-12-31T03:50:07 1735617007

I mean they have no way to monetize LLMs as well as others, so they’re working on it and giving it away to not look irrelevant and to weaken anyone who may make money off this tech and threaten them in the future. Meanwhile there is a danger they impose their long standing invisible “moderation” on everyone else once they starve all the startups of revenue by giving this away. We’ll just be left with the same big tech overlords to choose from.

Oh and it still isn’t open source even though people like Yann LeCun dishonestly claim it is. Only OLMo is truly open source among competitive models, as far as I know: https://allenai.org/blog/olmo2

scarface_74 · 2024-12-31T06:40:24 1735627224

They are definitely making some money off of their licensing to AWS as part of the bedrock offering. Facebook’s licensing is such that they aren’t going to let happen to them what happened to ElasticSearch, Redis, etc.

I’m okay with that.

BoorishBears · 2024-12-31T06:01:43 1735624903

> they have no way to monetize LLMs as well as others

Random nobodies are putting together companies to monetize generative AI and getting bought out a couple of years later, you think Meta couldn't figure out how to deploy their own models to an API and stick up a billing interface if they really wanted to? (or even buy a company that does already?)

> they starve all the startups of revenue by giving this away

Would you say startups like Deepseek have been hurt or help by their (even partial) openness?

In fact, how does this track with your first statement? They're not monetizing this: so their startup competition can actually serve their models to gain revenue which they then turn around use to train competitor models (we've already seen this with Fireworks.ai)

You seem to underestimate how much of the value in LLMs is productizing them. The margins on per-token usage are insane, Meta not taking that margin is creating a huge opportunity for a wave of startups in so many directions...

> Only OLMo is truly open source among competitive models

Synthetic data from competitor models was a huge part of that. It would seem no one is fighting the startups as hard as you're claiming they are.

bongodongobob · 2024-12-31T07:08:52 1735628932

All the LLM companies are going to eat those "product companies" lunch in a few years. Why would I use product X when it's going to be inevitably be baked into the actual tech itself? Those product companies are just wrappers and have even less of a moat than the LLM companies. The very fact that random nobodies are doing this should signal there isn't a lot of real value there. Yes, there is some money to be made right now but it reminds me a lot of the videogame bust and dotcom bust. A LOT of companies are wasting a crazy amount of money on "solutions" that will be obsolete in a few years.

BoorishBears · 2024-12-31T20:09:38 1735675778

Productization in this context is creating APIs for Meta's models.

Fireworks.ai, Together.ai, and literal boatloads of other startups are making real money just efficiently serving up these models that Meta is supposedly using to... choke out startups.

The comment I replied to is under the mistaken idea that the presence of free models from Meta has a chilling effect on startups trying to build their own models, but right now the biggest barriers are capital and data.

Meta updated Llama to allow for synthetic generation, and they're even partnering with these startups give them distribution and day 0 access to the models.

-

If anything I'd say Meta is actively fighting against the big tech overlords the comment thinks they're trying to join. Even before Ilya mentioned it, it was clear to me that the power of post-training was going to become more and more important (I've literally built a business on it).

Llama represents a real ongoing chance for tiny startups with minimal resources to get into the fray very affordably (through either offering inference, or post-training for a specific task, etc.), scale revenue, and then start to compete against much larger, resource rich companies.

spencerflem · 2024-12-31T03:55:57 1735617357

Facebook would rather do no moderation, it's an expense for them.

They do it to make the platform more pleasant so that people stay on it

graemep · 2024-12-31T04:10:40 1735618240

> They do it to make the platform more pleasant so that people stay on it

Almost everything unpleasant I see on FB is stuff that the FB algorithm shows me - not things posted by FB friends, or pages I follow or groups I am in.

nightski · 2024-12-31T17:56:48 1735667808

Everything you see on FB is what the algorithm shows you, unpleasant or not. So it's a tautology that everything unpleasant would be from the algorithm.

cess11 · 2024-12-31T11:18:56 1735643936

It's more likely they do it to keep their people from being coerced to visit the Hague. What they did in Myanmar got a lot of press and a case at the ICJ, and similar launches of 'free internet' elsewhere had similar results.

rlupi · 2024-12-31T20:31:17 1735677077

(tongue in cheek comment) I wonder if FB moderation now or eventually will be just a prompt to a sufficiently evolved and unhinged AI model:

> FB or 4chan?

blackeyeblitzar · 2024-12-31T04:10:59 1735618259

No they do it to support their owners’ and employees’ biases. It doesn’t make the platform more pleasant for the half that gets censored. That’s leaving aside the feed not remembering the choice to view chronologically ordered posts, the inability to easily track actual people in my life, the addictive algorithms, the clickbait that causes mental health issues for teens, etc.

creato · 2024-12-31T05:46:07 1735623967

99% of FB's moderation has nothing to do with "biases", unless you think FB is biased against spam, scams, and all the other dregs of the internet that incessantly pops up anywhere users can post content.

TheOtherHobbes · 2024-12-31T11:30:40 1735644640

Quite a few people left Threads for Bluesky because progressive posts were being removed while far-right, antivax, etc content was allowed to stand even though it was reported.

At best the algo is imperfect. At worst it really does seem oddly selective.

dudeinjapan · 2024-12-31T14:02:16 1735653736

I am a humble Cialis salesman, like my father and grandfather before me. I confirm Facebook is biased against our profession. (My grandfather also moonlighted as a Barrister representing the estates of deceased African royalty—it was always so difficult to track down their heirs.)

roywiggins · 2024-12-31T08:48:56 1735634936

The stuff that Facebook moderators are actually tasked with removing is really awful, bad enough to produce severe psychological effects in the moderators.

Facebook pays people to look at and remove this stuff because the platform would not survive if it wasn't removed before you or I saw it. Do they also enforce other corporate values? Yeah, probably. That doesn't seem to be the main job though, they have their hands full dealing with the worst content in the world.

https://amp-theguardian-com.cdn.ampproject.org/v/s/amp.thegu...

> The images and videos including necrophilia, bestiality and self-harm caused some moderators to faint, vomit, scream and run away from their desks...

> Some reported marriage breakdown and the collapse of desire for sexual intimacy, and losing connection with their families. Some whose job was to remove videos uploaded by terrorist and rebel groups were afraid they were being watched and targeted, and that if they returned home they would be hunted and killed.

rlupi · 2024-12-31T20:34:11 1735677251

In the agentic era, the new Ads eyeballs are the LLMs training corpus (IMHO).

jayd16 · 2024-12-31T06:28:03 1735626483

Is there any vendor lock-in with this conspiracy? Even if startups are pushed out of the spotlight, what stops them from competing? If the meta model is bad, won't it be even easier to make an alternative in the future?

fragmede · 2024-12-31T03:40:58 1735616458

don't buy their bullshit. it's not open source.

astrange · 2024-12-31T13:40:40 1735652440

I'm not sure open source is a useful concept for something that takes millions of dollars to compile from it.

speedgoose · 2024-12-31T07:49:36 1735631376

Yes it’s more about open weights. I also think that you would need the training data to consider it open source.

Open weights is still appreciated and they probably train on data they don’t have the license to open source.

cornel_io · 2024-12-31T11:15:46 1735643746

So, what's happening here on the surface is that it's an optimization (fairly meaningful, from the looks of it) aimed at doing roughly the same things we could already do with chain-of-thought (CoT), but IMO the downstream effects of this sort of optimization could be much more meaningful.

LLMs can already do a decent amount of "processing" in a single token generation because of the number of layers they have. The layers are trained independently so it's not exactly like they're a recurrent network doing multiple steps, but they are layering sequences of context-dependent transformations on top of each other; no matter how you cut it, if getting to a problem's answer requires 100 steps, you won't be able to do it in a single token output from a 20 layer LLM. To some approximation, CoT is just a way to give the network more chances to transform the data than there are layers in the network - each additional token of output gives a shot to bake another vector the size of the token embedding into each layer's state in the network, enriching what it's computed so far.

The problem with chain of thought is that as you add each new token, at the input level of the network, your computation is basically starting from scratch against the raw text, just with one additional token. You don't even have access to all the stuff you already figured out in the deepest layers of the network during the previous step! If you were processing "All wunguses are glurgles, and Joe is a wungus", then somewhere in those deepest layers as you're generating the next token you've almost certainly got some vector that basically represents "therefore Joe is a glurgle", but with chain of thought you've got to first output "t", then "h", then "e", and so on (I know those aren't tokens, let's pretend letter == token for argument sake), and during that process almost ALL of the work being done by the network is mere bookkeeping, slowly dumping that thought into the output stream. Only once you get the whole sentence out can you start processing the next token at the first layer with the information that Joe is, in fact, a glurgle, in hand. Which is a damn shame, because it's been sitting right there in the deeper layers of the network parallel to previous tokens this whole time, it just wasn't available for the shallow layers to process directly because you were casting most of the info away and "rounding" to a single token.

With Coconut's approach, you don't need to output "therefore Joe is a glurgle" token by token to continue the train of thought, you can essentially pass the entire thought through as a single uber-token, and the next pass can generate a new entire thought, and so on.

It's a pretty straightforward idea, IMO the neat bit is that they were able to train the network to work well in this way by leveraging CoT. I'm guessing you probably don't need to act as if these are two distinct modes of operation, you could instead always have this side channel of "continuous thought" running, even when you have generated a normal token, coming through as a separate input to the first attention block. You still might want to have a "thinking" token when you need to sit there and let the thing do more work, but you'd generally increase the information flow from time step to time step, which would allow the net to keep thinking in the background even as it's doing the gruntwork of outputting whatever its current "buffered" thought is.

YeGoblynQueenne · 2025-01-01T02:28:31 1735698511

>> Large language models (LLMs) have demonstrated incredible reasoning abilities, penetrating an increasing number of domains in our lives.

This is now established orthodoxy, a bit like astrology in ancient times but it is complete nonsense. Nope, LLMs have not demonstrated any credible, or incredible reasoning abilities. They have demonstrated an excellent ability for approximate retrieval of previously observed answers (with variations, which should not surprise anyone given that those are generative models) but they fail spectacularly when they have to "reason" in contexts where they really can't have seen the answer anywhere before. For example, the "randomised mystery blocksworld" from this paoer:

LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench

https://arxiv.org/abs/2409.13373

"Randomised Mystery Blocksworld" is a version of the good old blocksworld planning benchmark where the names of obejcts and actions have been changed to random strings. The vast majority of LLMs score pathetically low in this, but much better in non-randomised versions, very clearly demonstrating excellent memorisation skills, but pitiful reasoning ability. As you 'd expect from a, well, language. model.

>> A possible explanation for this is that the thought tokens allow the model to explore multiple possible branches before committing to a specific path, whereas chain-of-thought reasoning chooses a direction from the start. This ability is somewhat similar to Breadth-First Search (BFS).

Why BFS in particular? Why not DFS or A*? I can't see any breadth-first bias in those graphs. BFS is not the only graph-traversing algorithm.

behnamoh · 2024-12-31T04:34:08 1735619648

There was no reason to call it something it's not ("chain of cont. thought" ≠ coconut).

CGamesPlay · 2024-12-31T05:03:28 1735621408

Is your complaint here that the paper is not discussing a literal coconut?

ripped_britches · 2024-12-31T05:48:05 1735624085

We desperately need more literal coconut coverage here on HN

BoorishBears · 2024-12-31T06:03:40 1735625020

Not just any regular old coconuts "Coconut by Meta AI - Better LLM Reasoning with Chain of Continuous Thought?" coconuts

(Sometimes acronyms in titles are vague/misleading... this was not one of those times)

layer8 · 2024-12-31T14:53:16 1735656796

To be fair, it’s not even a metaphorical coconut. ;)

gloosx · 2024-12-31T09:23:45 1735637025

for sure, chocothot aligns better with letters

astrange · 2024-12-31T13:41:19 1735652479

Why is it "continuous" thought? I don't see what is continuous - the values inside an LLM are discrete even if they're floating point.

Hmm, I guess you could evaluate it at any given finite precision, but it would be surprising to me if that made it more accurate.

HarHarVeryFunny · 2024-12-31T18:06:52 1735668412

> the values inside an LLM are discrete even if they're floating point.

If that were true they'd never be able to learn anything - neural nets depend on continuous gradients to learn. Weights get updated by incremental/continuous amounts based on gradients.

Even at the output of an LLM, where the internal embeddings have been mapped to token probabilities, those probabilities are also continuous. It's only when you sample from the model that a continuous probability becomes a discrete chosen token.

astrange · 2025-01-02T08:27:57 1735806477

Treating it as continuous is a property of the training algorithm, but there are networks that use binary values.

https://ieeexplore.ieee.org/document/9359148

https://arxiv.org/abs/2205.13016

HarHarVeryFunny · 2025-01-02T16:59:51 1735837191

Those aren't methods of training networks - they are ways to compress (via quantization) networks that have already been trained.

astrange · 2025-01-03T04:59:15 1735880355

I know. The important thing is how the inference works on them.

HarHarVeryFunny · 2025-01-03T14:16:50 1735913810

But we're discussing a training technique, that explicitly takes advantage of the continuous (and embedding vs token probability) representations ...

You could quantize a model like this after training, as usual, but that's irrelevant.

astrange · 2025-01-04T12:54:25 1735995265

The paper title is "Training Large Language Models to Reason in a Continuous Latent Space". It's true it says training in the title, but the goal (reasoning in continuous space) happens at inference time.

mkl · 2024-12-31T14:17:35 1735654655

It's far more continuous than constantly jumping to the nearest token vector. The fact that real numbers are approximated by floating point isn't really relevant.

layer8 · 2024-12-31T14:56:14 1735656974

If you are continuously complaining, does it mean you do it non-discretely and with infinite precision?

astrange · 2024-12-31T17:17:14 1735665434

It apparently uses the same iteration strategy as tokenized thinking, so that's not it.

> Since both strategies provided comparable results, the researchers opted for using a constant number of thoughts for simplicity.