"The 10M context ability wipes out most RAG stack complexity immediately."
I'm skeptical, my past experience is just becaues the context has room to stuff whatever you want in it, the more you stuff in the context the less accurate your results are. There seems to be this balance of providing enough that you'll get high quality answers, but not too much that the model is overwhelmed.
I think a large part of developing better models is not just a better architectures that support larger and larger context sizes, but also capable models that can properly leverage that context. That's the test for me.
I don't think recall really addresses it sufficiently: the main issue I see is answers getting "muddy". Like it's getting pulled in too many directions and averaging.
I'm not saying it applies to the new architecture, I'm saying that's a big issue I've observed in existing models and that so far we have no info on whether it's solved in the new one (i.e. accurate recall doesn't imply much in that regard).
Ah, apologies for the misunderstanding. What tests would you suggest to evaluate "muddiness"?
What comes to my mind: run the usual gamut of tests, but with the excess context window saturated with irrelevant(?) data. Measure test answer accuracy/verbosity as a function of context saturation percentage. If there's little correlation between these two variables (e.g. 9% saturation is just as accurate/succinct as 99% saturation), then "muddiness" isn't an issue.
Manual testing on complex documents. A big legal contract for example. An issue can be referred to in 7 different places in a 100 page document. Does it give a coherent answer?
A handful of examples show whether it can do it. For example, GPT-4 turbo is downright awful at something like that.
You need to use relevant data. The question isn't random sorting/pruning, but being able to apply large numbers of related hints/references/definitions in a meaningful way. To me this would be the entire point of a large context window. For entirely different topics you can always just start a new instance.
This is only specifically for web interface LLMs in the past few years that it's been lack luster. However, this statements is not correct for their overall history. W2V based lang models and BERT/Transformer models in the early days (*publicly available, but not in web interface) were far ahead of the curve, as they were the ones that produced these innovations.
Effectively, Deepmmind/Google are academics (where the real innovations are made, but they do struggle to produce corporate products (where openai shines).
I am skeptical of benchmarks in general, to be honest. It seems to be extremely difficult to come up with benchmarks for these things (it may be true of intelligence as a quality...). It's almost an anti-signal to proclaim good results on benchmarks. The best barometer of model quality has been vibes, in places like /r/localllama where cracked posters are actively testing the newest models out.
Based on Google's track record in the area of text chatbots, I am extremely skeptical of their claims about coherency across a 1M+ context window.
Of course none of this even matters anyway because the weights are closed the architecture is closed nobody has access to the model. I'll believe it when I see it.
Their in-context long-sequence understanding "benchmark" is pretty interesting.
There's a language called Kalamang with only 200 native speakers left. There's a set of grammar books for this language that adds up to ~250K tokens. [1]
They set up a test of in-context learning capabilities at long context - they asked 3 long-context models (GPT 4 Turbo, Claude 2.1, Gemini 1.5) to perform various Kalamang -> English and English -> Kalamang translation tasks. These are done either 0-shot (no prior training data for kgv in the models), half-book (half of the kgv grammar/wordlists - 125k tokens - are fed into the model as part of the prompt), and full-book (the whole 250k tokens are fed into the model). Finally, they had human raters check these translations.
This is a really neat setup, it tests for various things (e.g. did the model really "learn" anything from these massive grammar books) beyond just synthetic memorize-this-phrase-and-regurgitate-it-later tests.
It'd be great to make this and other reasoning-at-long-ctx benchmarks a standard affair for evaluating context extension. I can't tell which of the many context-extension methods (PI, E2 LLM, PoSE, ReRoPE, SelfExtend, ABF, NTK-Aware ABF, NTK-by-parts, Giraffe, YaRN, Entropy ABF, Dynamic YaRN, Dynamic NTK ABF, CoCA, Alibi, FIRE, T5 Rel-Pos, NoPE, etc etc) is really SoTA since they all use different benchmarks, meaningless benchmarks, or drastically different methodologies that there's no fair comparison.
The available resources for Kalamang are: field linguistics documentation10 comprising a ∼500 page reference grammar, a ∼2000-entry bilingual wordlist, and a set of ∼400 additional parallel sentences. In total the available resources for Kalamang add up to around ∼250k tokens.
Haha. This was my thinking this morning. Like: "Oh cool... a talking computer.... but can it read a 2000 page book, give me the summary and find a sentence out of... it can? Oh... well it's lame anyway."
The Sora release is even more mind blowing - not the video generation in my mind but the idea that it can infer properties of reality that it has to learn and constrain in its weights to properly generate realistic video. A side effect of its ability is literally a small universe of understanding.
I was thinking that I want to play with audio to audio LLMs. Not text to speech and reverse but literally sound in sound out. It clears away the problem of document layout etc. and leaves room for experimentation on the properties of a cognitive being.
Page 8 of the technical paper [1] is especially informative.
The first chart (Cumulative Average NLL for Long Documents) shows a deviation from the trend and an increase in accuracy when working with >=1M tokens. The 1.0 graph is overlaid and supports the experience of 'muddiness'.
Yes and the ability to have direct attribution matters so you know exactly where your responses come from. And costs as others point out, but RAG is not gone in fact it just got easier and a lot more powerful.
costs rise on a per-token basis. So you CAN use 10M tokens, but it's probably not usually a good idea. A database lookup is still better than a few billion math operations.
Somehow I just don't see the execs or managers being able to make this work well for them without help. Plus, documents still need to be generated. Are they going to be spending all day prompting LLMs?
LLMs are able to utilize “all the worlds” knowledge during training and give seemingly magical answers. While providing context in the query is different than training models, is it possible that more context will give more materials to the LLM and it will be able to pick out the relevant bits on its own?
What if it was possible, with each query, to fine tune the model on the provided context, and then use that JIT fine-tuned model to answer the query?
Are you asking what if it was possible that a "context window" ceased to exist? In a different architecture than we currently use, I guess that's hypothetically possible.
As it is now, you can't fine tune on context. It would have almost no effect on the parameters.
Context is like giving your friend a magazine article and asking them to respond to it. Fine tuning is like throwing that magazine article into the ocean of all content they ever came across during their lifetime.
I am not an expert here so I may be mixing terms and concepts.
The way I understand it, there is a base model that was trained on vast amount of general data. This sets up the weights.
You can fine-tune this base model on additional data. Often this is private data that is concentrated around a certain domain. This modifies the model's weights some more.
Then you have the context. This is where your query to the LLM goes. You can also add the chat history here. Also, system prompts that tell the LLM to behave a certain way go here. Finally, you can take additional information from other sources and provide it as part of the context -- this is called Retrieval Augmented Generation. All of this really goes into one bucket called the context, and the LLM needs to make sense of it. None of this modifies the weights of the model itself.
Is my mental picture correct so far?
My question is around RAG. It seems that providing additional selected information from your knowledge base, or using your knowledge base to fine-tune a model, seem similar. I am curious in which ways these are similar, and in which ways they cause the LLM to behave differently.
Concretely, say I have a company knowledge base with a bunch of rules and guidelines. Someone asks an agent "Can I take 3 weeks off in a row?" How would these two scenarios be different:
a) Agents searches the knowledge base for all pages and content related to "FTO, PTO, time off, vacations" and feeds those articles to the LLM, together with the "Can I take 3 weeks off in a row?" query
b) I have an LLM that has been fine tuned on all the content in the knowledge base. I ask it "Can I take 3 weeks off in a row?"
They're different in exactly the way you described above. The agent searching the knowledge base for "FTO, PTO, time off, vacations" would be the same as you pasting all the articles related to those topics into the prompt directly - in both cases, it goes into the context.
In scenario a, you'll likely get the correct response.
In scenario b, likely get an incorrect response.
Why? Because of what you explained above. Fine tuning adjusts the weights. When you adjusts weights by feeding data, you're only making small adjustments to shift slightly along a curve - thus the exposure to this data (for the purposes of fine tuning) will have very little effect on the next context the model is exposed to.
Have to consider cost for all of this. Big value of RAG already even given the size of GPT-4’a largest context size is it decreases cost very significantly.
also costs are always based on context token, you dont want to put in 10m of context for every request (its just nice to have that option when you want to do big things that dont scale)
I'm skeptical, my past experience is just becaues the context has room to stuff whatever you want in it, the more you stuff in the context the less accurate your results are. There seems to be this balance of providing enough that you'll get high quality answers, but not too much that the model is overwhelmed.
I think a large part of developing better models is not just a better architectures that support larger and larger context sizes, but also capable models that can properly leverage that context. That's the test for me.