"The 10M context ability wipes out most RAG stack complexity immediately." I'm s...

HereBePandas · on Feb 15, 2024

They explicitly address this in page 11 of the report. Basically perfect recall for up to 1M tokens; way better than GPT-4.

westoncb · on Feb 15, 2024

I don't think recall really addresses it sufficiently: the main issue I see is answers getting "muddy". Like it's getting pulled in too many directions and averaging.

a_wild_dandan · on Feb 15, 2024

I'd urge caution in extending generalizations about "muddiness" to a new context architecture. Let's use the thing first.

westoncb · on Feb 15, 2024

I'm not saying it applies to the new architecture, I'm saying that's a big issue I've observed in existing models and that so far we have no info on whether it's solved in the new one (i.e. accurate recall doesn't imply much in that regard).

a_wild_dandan · on Feb 15, 2024

Ah, apologies for the misunderstanding. What tests would you suggest to evaluate "muddiness"?

What comes to my mind: run the usual gamut of tests, but with the excess context window saturated with irrelevant(?) data. Measure test answer accuracy/verbosity as a function of context saturation percentage. If there's little correlation between these two variables (e.g. 9% saturation is just as accurate/succinct as 99% saturation), then "muddiness" isn't an issue.

danielmarkbruce · on Feb 15, 2024

Manual testing on complex documents. A big legal contract for example. An issue can be referred to in 7 different places in a 100 page document. Does it give a coherent answer?

A handful of examples show whether it can do it. For example, GPT-4 turbo is downright awful at something like that.

somenameforme · on Feb 16, 2024

You need to use relevant data. The question isn't random sorting/pruning, but being able to apply large numbers of related hints/references/definitions in a meaningful way. To me this would be the entire point of a large context window. For entirely different topics you can always just start a new instance.

westoncb · on Feb 15, 2024

Would be awesome if it is solved but seems like a much deeper problem tbh.

caesil · on Feb 15, 2024

Unfortunately Google's track record with language models is one of overpromising and underdelivering.

chaxor · on Feb 16, 2024

This is only specifically for web interface LLMs in the past few years that it's been lack luster. However, this statements is not correct for their overall history. W2V based lang models and BERT/Transformer models in the early days (*publicly available, but not in web interface) were far ahead of the curve, as they were the ones that produced these innovations. Effectively, Deepmmind/Google are academics (where the real innovations are made, but they do struggle to produce corporate products (where openai shines).

mlsu · on Feb 16, 2024

I am skeptical of benchmarks in general, to be honest. It seems to be extremely difficult to come up with benchmarks for these things (it may be true of intelligence as a quality...). It's almost an anti-signal to proclaim good results on benchmarks. The best barometer of model quality has been vibes, in places like /r/localllama where cracked posters are actively testing the newest models out.

Based on Google's track record in the area of text chatbots, I am extremely skeptical of their claims about coherency across a 1M+ context window.

Of course none of this even matters anyway because the weights are closed the architecture is closed nobody has access to the model. I'll believe it when I see it.

leegao · on Feb 16, 2024

Their in-context long-sequence understanding "benchmark" is pretty interesting.

There's a language called Kalamang with only 200 native speakers left. There's a set of grammar books for this language that adds up to ~250K tokens. [1]

They set up a test of in-context learning capabilities at long context - they asked 3 long-context models (GPT 4 Turbo, Claude 2.1, Gemini 1.5) to perform various Kalamang -> English and English -> Kalamang translation tasks. These are done either 0-shot (no prior training data for kgv in the models), half-book (half of the kgv grammar/wordlists - 125k tokens - are fed into the model as part of the prompt), and full-book (the whole 250k tokens are fed into the model). Finally, they had human raters check these translations.

This is a really neat setup, it tests for various things (e.g. did the model really "learn" anything from these massive grammar books) beyond just synthetic memorize-this-phrase-and-regurgitate-it-later tests.

It'd be great to make this and other reasoning-at-long-ctx benchmarks a standard affair for evaluating context extension. I can't tell which of the many context-extension methods (PI, E2 LLM, PoSE, ReRoPE, SelfExtend, ABF, NTK-Aware ABF, NTK-by-parts, Giraffe, YaRN, Entropy ABF, Dynamic YaRN, Dynamic NTK ABF, CoCA, Alibi, FIRE, T5 Rel-Pos, NoPE, etc etc) is really SoTA since they all use different benchmarks, meaningless benchmarks, or drastically different methodologies that there's no fair comparison.

[1] from https://storage.googleapis.com/deepmind-media/gemini/gemini_...

The available resources for Kalamang are: field linguistics documentation10 comprising a ∼500 page reference grammar, a ∼2000-entry bilingual wordlist, and a set of ∼400 additional parallel sentences. In total the available resources for Kalamang add up to around ∼250k tokens.

smeagull · on Feb 15, 2024

I believe that's a limitation of using vectors of high dimensions. It'll be muddy.

Aeolun · on Feb 16, 2024

Not unlike trying to keep the whole contents of the document in your own mind :)

sirsinsalot · on Feb 16, 2024

It's amazing we are in 2024 discussing the degree a machine can reason over millions of tokens of context. The degree, not the possibility.

razodactyl · on Feb 16, 2024

Haha. This was my thinking this morning. Like: "Oh cool... a talking computer.... but can it read a 2000 page book, give me the summary and find a sentence out of... it can? Oh... well it's lame anyway."

The Sora release is even more mind blowing - not the video generation in my mind but the idea that it can infer properties of reality that it has to learn and constrain in its weights to properly generate realistic video. A side effect of its ability is literally a small universe of understanding.

I was thinking that I want to play with audio to audio LLMs. Not text to speech and reverse but literally sound in sound out. It clears away the problem of document layout etc. and leaves room for experimentation on the properties of a cognitive being.

andy_ppp · on Feb 16, 2024

Did you think the extraction of information from a the Buster Keaton film was muddy? I thought it was incredibly impressive to be this precise.

westoncb · on Feb 17, 2024

That was not muddy, but it's not the kind of scenario where muddiness shows up.

tcdent · on Feb 16, 2024

Page 8 of the technical paper [1] is especially informative.

The first chart (Cumulative Average NLL for Long Documents) shows a deviation from the trend and an increase in accuracy when working with >=1M tokens. The 1.0 graph is overlaid and supports the experience of 'muddiness'.

[1] https://storage.googleapis.com/deepmind-media/gemini/gemini_...

chuckcode · on Feb 15, 2024

Would like to see the latency and cost of parsing entire 10M context before throwing out the RAG stack which is relatively cheap and fast.

theolivenbaum · on Feb 15, 2024

Also unless they significantly change their pricing model, we're talking about 0.5$ per API call at current prices

patja · on Feb 16, 2024

I think there are also a lot of people who are only interested in RAG if they can self-host and keep their documents private.

jimmySixDOF · on Feb 16, 2024

Yes and the ability to have direct attribution matters so you know exactly where your responses come from. And costs as others point out, but RAG is not gone in fact it just got easier and a lot more powerful.

tkellogg · on Feb 15, 2024

costs rise on a per-token basis. So you CAN use 10M tokens, but it's probably not usually a good idea. A database lookup is still better than a few billion math operations.

sjwhevvvvvsj · on Feb 15, 2024

I think the unspoken goal is to just lay off your employees and dump every doc and email they’ve ever written as one big context.

Now that Google has tasted the previously forbidden fruit of layoffs themselves, I think their primary goal in ML is now headcount reduction.

goatlover · on Feb 16, 2024

Somehow I just don't see the execs or managers being able to make this work well for them without help. Plus, documents still need to be generated. Are they going to be spending all day prompting LLMs?

koliber · on Feb 16, 2024

LLMs are able to utilize “all the worlds” knowledge during training and give seemingly magical answers. While providing context in the query is different than training models, is it possible that more context will give more materials to the LLM and it will be able to pick out the relevant bits on its own?

What if it was possible, with each query, to fine tune the model on the provided context, and then use that JIT fine-tuned model to answer the query?

acchow · on Feb 17, 2024

Are you asking what if it was possible that a "context window" ceased to exist? In a different architecture than we currently use, I guess that's hypothetically possible.

As it is now, you can't fine tune on context. It would have almost no effect on the parameters.

Context is like giving your friend a magazine article and asking them to respond to it. Fine tuning is like throwing that magazine article into the ocean of all content they ever came across during their lifetime.

koliber · on Feb 17, 2024

I am not an expert here so I may be mixing terms and concepts.

The way I understand it, there is a base model that was trained on vast amount of general data. This sets up the weights.

You can fine-tune this base model on additional data. Often this is private data that is concentrated around a certain domain. This modifies the model's weights some more.

Then you have the context. This is where your query to the LLM goes. You can also add the chat history here. Also, system prompts that tell the LLM to behave a certain way go here. Finally, you can take additional information from other sources and provide it as part of the context -- this is called Retrieval Augmented Generation. All of this really goes into one bucket called the context, and the LLM needs to make sense of it. None of this modifies the weights of the model itself.

Is my mental picture correct so far?

My question is around RAG. It seems that providing additional selected information from your knowledge base, or using your knowledge base to fine-tune a model, seem similar. I am curious in which ways these are similar, and in which ways they cause the LLM to behave differently.

Concretely, say I have a company knowledge base with a bunch of rules and guidelines. Someone asks an agent "Can I take 3 weeks off in a row?" How would these two scenarios be different:

a) Agents searches the knowledge base for all pages and content related to "FTO, PTO, time off, vacations" and feeds those articles to the LLM, together with the "Can I take 3 weeks off in a row?" query

b) I have an LLM that has been fine tuned on all the content in the knowledge base. I ask it "Can I take 3 weeks off in a row?"

acchow · on Feb 20, 2024

> Is my mental picture correct so far?

Yes

> How would these two scenarios be different

They're different in exactly the way you described above. The agent searching the knowledge base for "FTO, PTO, time off, vacations" would be the same as you pasting all the articles related to those topics into the prompt directly - in both cases, it goes into the context.

In scenario a, you'll likely get the correct response. In scenario b, likely get an incorrect response.

Why? Because of what you explained above. Fine tuning adjusts the weights. When you adjusts weights by feeding data, you're only making small adjustments to shift slightly along a curve - thus the exposure to this data (for the purposes of fine tuning) will have very little effect on the next context the model is exposed to.

aik · on Feb 15, 2024

Have to consider cost for all of this. Big value of RAG already even given the size of GPT-4’a largest context size is it decreases cost very significantly.

swyx · on Feb 15, 2024

also costs are always based on context token, you dont want to put in 10m of context for every request (its just nice to have that option when you want to do big things that dont scale)

1024core · on Feb 15, 2024

How much would a lawyer charge to review your 10M-token legal document?

hereonout2 · on Feb 16, 2024

10M tokens is something like 14 copies of war and peace, or maybe the entire harry potter series seven times over. That'd be some legal document!

xp84 · on Feb 16, 2024

Hmm I don’t know but I feel like the U.S. Congress has bills that would push that limit.