Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The white paper is worth a read. The things that stand out to me are:

1. They don't talk about how they get to 10M token context

2. They don't talk about how they get to 10M token context

3. The 10M context ability wipes out most RAG stack complexity immediately. (I imagine creating caching abilities is going to be important for a lot of long token chatting features now, though). This is going to make things much, much simpler for a lot of use cases.

4. They are pretty clear that 1.5 Pro is better than GPT-4 in general, and therefore we have a new LLM-as-judge leader, which is pretty interesting.

5. It seems like 1.5 Ultra is going to be highly capable. 1.5 Pro is already very very capable. They are running up against very high scores on many tests, and took a minute to call out some tests where they scored badly as mostly returning false negatives.

Upshot, 1.5 Pro looks like it should set the bar for a bunch of workflow tasks, if we can ever get our hands on it. I've found 1.0 Ultra to be very capable, if a bit slow. Open models downstream should see a significant uptick in quality using it, which is great.

Time to dust out my coding test again, I think, which is: "here is a tarball of a repository. Write a new module that does X".

I really want to know how they're getting to 10M context, though. There are some intriguing clues in their results that this isn't just a single ultra-long vector; for instance, their audio and video "needle" tests, which just include inserting an image that says "the magic word is: xxx", or an audio clip that says the same thing, have perfect recall across up to 10M tokens. The text insertion occasionally fails. I'd speculate that this means there is some sort of compression going on; a full video frame with text on it is going to use a lot more tokens than the text needle.



"The 10M context ability wipes out most RAG stack complexity immediately."

I'm skeptical, my past experience is just becaues the context has room to stuff whatever you want in it, the more you stuff in the context the less accurate your results are. There seems to be this balance of providing enough that you'll get high quality answers, but not too much that the model is overwhelmed.

I think a large part of developing better models is not just a better architectures that support larger and larger context sizes, but also capable models that can properly leverage that context. That's the test for me.


They explicitly address this in page 11 of the report. Basically perfect recall for up to 1M tokens; way better than GPT-4.


I don't think recall really addresses it sufficiently: the main issue I see is answers getting "muddy". Like it's getting pulled in too many directions and averaging.


I'd urge caution in extending generalizations about "muddiness" to a new context architecture. Let's use the thing first.


I'm not saying it applies to the new architecture, I'm saying that's a big issue I've observed in existing models and that so far we have no info on whether it's solved in the new one (i.e. accurate recall doesn't imply much in that regard).


Ah, apologies for the misunderstanding. What tests would you suggest to evaluate "muddiness"?

What comes to my mind: run the usual gamut of tests, but with the excess context window saturated with irrelevant(?) data. Measure test answer accuracy/verbosity as a function of context saturation percentage. If there's little correlation between these two variables (e.g. 9% saturation is just as accurate/succinct as 99% saturation), then "muddiness" isn't an issue.


Manual testing on complex documents. A big legal contract for example. An issue can be referred to in 7 different places in a 100 page document. Does it give a coherent answer?

A handful of examples show whether it can do it. For example, GPT-4 turbo is downright awful at something like that.


You need to use relevant data. The question isn't random sorting/pruning, but being able to apply large numbers of related hints/references/definitions in a meaningful way. To me this would be the entire point of a large context window. For entirely different topics you can always just start a new instance.


Would be awesome if it is solved but seems like a much deeper problem tbh.


Unfortunately Google's track record with language models is one of overpromising and underdelivering.


This is only specifically for web interface LLMs in the past few years that it's been lack luster. However, this statements is not correct for their overall history. W2V based lang models and BERT/Transformer models in the early days (*publicly available, but not in web interface) were far ahead of the curve, as they were the ones that produced these innovations. Effectively, Deepmmind/Google are academics (where the real innovations are made, but they do struggle to produce corporate products (where openai shines).


I am skeptical of benchmarks in general, to be honest. It seems to be extremely difficult to come up with benchmarks for these things (it may be true of intelligence as a quality...). It's almost an anti-signal to proclaim good results on benchmarks. The best barometer of model quality has been vibes, in places like /r/localllama where cracked posters are actively testing the newest models out.

Based on Google's track record in the area of text chatbots, I am extremely skeptical of their claims about coherency across a 1M+ context window.

Of course none of this even matters anyway because the weights are closed the architecture is closed nobody has access to the model. I'll believe it when I see it.


Their in-context long-sequence understanding "benchmark" is pretty interesting.

There's a language called Kalamang with only 200 native speakers left. There's a set of grammar books for this language that adds up to ~250K tokens. [1]

They set up a test of in-context learning capabilities at long context - they asked 3 long-context models (GPT 4 Turbo, Claude 2.1, Gemini 1.5) to perform various Kalamang -> English and English -> Kalamang translation tasks. These are done either 0-shot (no prior training data for kgv in the models), half-book (half of the kgv grammar/wordlists - 125k tokens - are fed into the model as part of the prompt), and full-book (the whole 250k tokens are fed into the model). Finally, they had human raters check these translations.

This is a really neat setup, it tests for various things (e.g. did the model really "learn" anything from these massive grammar books) beyond just synthetic memorize-this-phrase-and-regurgitate-it-later tests.

It'd be great to make this and other reasoning-at-long-ctx benchmarks a standard affair for evaluating context extension. I can't tell which of the many context-extension methods (PI, E2 LLM, PoSE, ReRoPE, SelfExtend, ABF, NTK-Aware ABF, NTK-by-parts, Giraffe, YaRN, Entropy ABF, Dynamic YaRN, Dynamic NTK ABF, CoCA, Alibi, FIRE, T5 Rel-Pos, NoPE, etc etc) is really SoTA since they all use different benchmarks, meaningless benchmarks, or drastically different methodologies that there's no fair comparison.

[1] from https://storage.googleapis.com/deepmind-media/gemini/gemini_...

The available resources for Kalamang are: field linguistics documentation10 comprising a ∼500 page reference grammar, a ∼2000-entry bilingual wordlist, and a set of ∼400 additional parallel sentences. In total the available resources for Kalamang add up to around ∼250k tokens.


I believe that's a limitation of using vectors of high dimensions. It'll be muddy.


Not unlike trying to keep the whole contents of the document in your own mind :)


It's amazing we are in 2024 discussing the degree a machine can reason over millions of tokens of context. The degree, not the possibility.


Haha. This was my thinking this morning. Like: "Oh cool... a talking computer.... but can it read a 2000 page book, give me the summary and find a sentence out of... it can? Oh... well it's lame anyway."

The Sora release is even more mind blowing - not the video generation in my mind but the idea that it can infer properties of reality that it has to learn and constrain in its weights to properly generate realistic video. A side effect of its ability is literally a small universe of understanding.

I was thinking that I want to play with audio to audio LLMs. Not text to speech and reverse but literally sound in sound out. It clears away the problem of document layout etc. and leaves room for experimentation on the properties of a cognitive being.


Did you think the extraction of information from a the Buster Keaton film was muddy? I thought it was incredibly impressive to be this precise.


That was not muddy, but it's not the kind of scenario where muddiness shows up.


Page 8 of the technical paper [1] is especially informative.

The first chart (Cumulative Average NLL for Long Documents) shows a deviation from the trend and an increase in accuracy when working with >=1M tokens. The 1.0 graph is overlaid and supports the experience of 'muddiness'.

[1] https://storage.googleapis.com/deepmind-media/gemini/gemini_...


Would like to see the latency and cost of parsing entire 10M context before throwing out the RAG stack which is relatively cheap and fast.


Also unless they significantly change their pricing model, we're talking about 0.5$ per API call at current prices


I think there are also a lot of people who are only interested in RAG if they can self-host and keep their documents private.


Yes and the ability to have direct attribution matters so you know exactly where your responses come from. And costs as others point out, but RAG is not gone in fact it just got easier and a lot more powerful.


costs rise on a per-token basis. So you CAN use 10M tokens, but it's probably not usually a good idea. A database lookup is still better than a few billion math operations.


I think the unspoken goal is to just lay off your employees and dump every doc and email they’ve ever written as one big context.

Now that Google has tasted the previously forbidden fruit of layoffs themselves, I think their primary goal in ML is now headcount reduction.


Somehow I just don't see the execs or managers being able to make this work well for them without help. Plus, documents still need to be generated. Are they going to be spending all day prompting LLMs?


LLMs are able to utilize “all the worlds” knowledge during training and give seemingly magical answers. While providing context in the query is different than training models, is it possible that more context will give more materials to the LLM and it will be able to pick out the relevant bits on its own?

What if it was possible, with each query, to fine tune the model on the provided context, and then use that JIT fine-tuned model to answer the query?


Are you asking what if it was possible that a "context window" ceased to exist? In a different architecture than we currently use, I guess that's hypothetically possible.

As it is now, you can't fine tune on context. It would have almost no effect on the parameters.

Context is like giving your friend a magazine article and asking them to respond to it. Fine tuning is like throwing that magazine article into the ocean of all content they ever came across during their lifetime.


I am not an expert here so I may be mixing terms and concepts.

The way I understand it, there is a base model that was trained on vast amount of general data. This sets up the weights.

You can fine-tune this base model on additional data. Often this is private data that is concentrated around a certain domain. This modifies the model's weights some more.

Then you have the context. This is where your query to the LLM goes. You can also add the chat history here. Also, system prompts that tell the LLM to behave a certain way go here. Finally, you can take additional information from other sources and provide it as part of the context -- this is called Retrieval Augmented Generation. All of this really goes into one bucket called the context, and the LLM needs to make sense of it. None of this modifies the weights of the model itself.

Is my mental picture correct so far?

My question is around RAG. It seems that providing additional selected information from your knowledge base, or using your knowledge base to fine-tune a model, seem similar. I am curious in which ways these are similar, and in which ways they cause the LLM to behave differently.

Concretely, say I have a company knowledge base with a bunch of rules and guidelines. Someone asks an agent "Can I take 3 weeks off in a row?" How would these two scenarios be different:

a) Agents searches the knowledge base for all pages and content related to "FTO, PTO, time off, vacations" and feeds those articles to the LLM, together with the "Can I take 3 weeks off in a row?" query

b) I have an LLM that has been fine tuned on all the content in the knowledge base. I ask it "Can I take 3 weeks off in a row?"


> Is my mental picture correct so far?

Yes

> How would these two scenarios be different

They're different in exactly the way you described above. The agent searching the knowledge base for "FTO, PTO, time off, vacations" would be the same as you pasting all the articles related to those topics into the prompt directly - in both cases, it goes into the context.

In scenario a, you'll likely get the correct response. In scenario b, likely get an incorrect response.

Why? Because of what you explained above. Fine tuning adjusts the weights. When you adjusts weights by feeding data, you're only making small adjustments to shift slightly along a curve - thus the exposure to this data (for the purposes of fine tuning) will have very little effect on the next context the model is exposed to.


Have to consider cost for all of this. Big value of RAG already even given the size of GPT-4’a largest context size is it decreases cost very significantly.


also costs are always based on context token, you dont want to put in 10m of context for every request (its just nice to have that option when you want to do big things that dont scale)


How much would a lawyer charge to review your 10M-token legal document?


10M tokens is something like 14 copies of war and peace, or maybe the entire harry potter series seven times over. That'd be some legal document!


Hmm I don’t know but I feel like the U.S. Congress has bills that would push that limit.


> They are pretty clear that 1.5 Pro is better than GPT-4 in general, and therefore we have a new LLM-as-judge leader, which is pretty interesting.

They try to push that, but it's not the most convincing. Look at Table 8 for text evaluations (math, etc.) - they don't even attempt a comparison with GPT-4.

GPT-4 is higher than any Gemini model on both MMLU and GSM8K. Gemini Pro seems slightly better than GPT-4 original in Human Eval (67->71). Gemini Pro does crush naive GPT-4 on math (though not with code interpreter and this is the original model).

All in 1.5 Pro seems maybe a bit better than 1.0 Ultra. Given that in the wild people seem to find GPT-4 better for say coding than Gemini Ultra, my current update is Pro 1.5 is about equal to GPT-4.

But we'll see once released.


> people seem to find GPT-4 better for say coding than Gemini Ultra

For my use cases, Gemini Ultra performs significantly better than GPT-4.

My prompts are long and complex, with a paragraph or two about the general objective followed by 15 to 20 numbered requirements. Often I'll include existing functions the new code needs to work with, or functions that must be refactored to handle the new requirements.

I took 20 prompts that I'd run with GPT-4 and fed them to Gemini Ultra. Gemini gave a clearly better result in 16 out of 20 cases.

Where GPT-4 might miss one or two requirements, Gemini usually got them all. Where GPT-4 might require multiple chat turns to point out its errors and omissions and tell it to fix them, Gemini often returned the result I wanted in one shot. Where GPT-4 hallucinated a method that doesn't exist, or had been deprecated years ago, Gemini used correct methods. Where GPT-4 called methods of third-party packages it assumed were installed, Gemini either used native code or explicitly called out the dependency.

For the 4 out of 20 prompts where Gemini did worse, one was a weird rejection where I'd included an image in the prompt and Gemini refused to work with it because it had unrecognizable human forms in the distance. Another was a simple bash script to split a text file, and it came up with a technically correct but complex one-liner, while GPT-4 just used split with simple options to get the same result.

For now I subscribe to both. But I'm using Gemini for almost all coding work, only checking in with GPT-4 when Gemini stumbles, which isn't often. If I continue to get solid results I'll drop the GPT-4 subscription.


I have a very similar prompting style to yours and share this experience.

I am an experienced programmer and usually have a fairly exact idea of what I want, so I write detailed requirements and use the models more as typing accelerators.

GPT-4 is useful in this regard, but I also tried about a dozen older prompts on Gemini Advanced/Ultra recently and in every case preferred the Ultra output. The code was usually more complete and prod-ready, with higher sophistication in its construction and somewhat higher density. It was just closer to what I would have hand-written.

It's increasingly clear though LLM use has a couple of different major modes among end-user behavior. Knowledge base vs. reasoning, exploratory vs. completion, instruction following vs. getting suggestions, etc.

For programming I want an obedient instruction-following completer with great reasoning. Gemini Ultra seems to do this better than GPT-4 for me.


It constantly hallucinates APIs for me, I really wonder why people's perceptions are so radically different. For me it's basically unusable for coding. Perhaps I'm getting a cheaper model because I live in a poorer country.


Are you using Gemini Advanced? (The paid tier.) The free one is indeed very bad.


Spent a few hours comparing Gemini Advanced with GPT-4.

Gemini Advanced is nowhere even close to GPT-4, either for text generation, code generation or logical reasoning.

Gemini Advanced is constantly asking for directions "What are your thoughts on this approach?" even to create a short task list of 10 items. Even when being told several times to provide the full list, and not stop at every three or four items and ask for directions. Is constantly giving moral lessons or finishing the results with annoying marketing style comments of the type "Let's make this an awesome product!"

Code is more generic, solutions are less sophisticated. On a discussion of Options Trading strategies Gemini Advanced got core risk management strategies wrong and apologized when errors were made clear to the model. GPT-4 provided answers with no errors, and even went into the subtleties of some exotic risk scenarios with no mistakes.

Maybe 1.5 will be it, or maybe Google realized this quite quickly and are trying the increased token size as a Hail Mary to catch up. Why release so soon?

Quite curious to try the same prompts on 1.5.


I asked Gemini Advanced, the paid one, to "Write a script to delete some files" and it told me that it couldn't do that because deleting files was unethical. At that point I cancelled my subscription since even GPT-4 with all its problems isn't nearly as broken as Gemini.


If you share your prompt I'm sure people here can help you.

Here's a prompt I used and got a a script that not only accomplishes the objective, but even has an option to show what files will be deleted and asks for confirmation before deleting them.

Write a bash script to delete all files with the extension .log in the current directory and all subdirectories of the current directory.


I’m going to have to try Gemini for code again. It just occurred to me as a Xoogler that if they used Google’s code base as the training data it’s going to be unbeatable. Now did they do that? No idea, but quality wins over quantity, even with LLM.


There is no way NTK data is in the training set, and google3 is NTK.


I dunno, leadership is desperate and they can de-NTK if and when they feel like it.


What is “NTK”?


"Need To Know" I.e. data that isn't open within the company.


Almost all of google3 is basically open to all of engineering.


> My prompts are long and complex, with a paragraph or two about the general objective followed by 15 to 20 numbered requirements. Often I'll include existing functions the new code needs to work with, or functions that must be refactored to handle the new requirements.

I guess this is a tough request if you're working on a proprietary code base, but I would love to see some concrete examples of the prompts and the code they produce.

I keep trying this kind of prompting with various LLM tools including GPT-4 (haven't tried Gemini Ultra yet, I admit) and it nearly always takes me longer to explain the detailed requirements and clean up the generated code than it would have taken me to write the code directly.

But plenty of people seem to have an experience more like yours, so I really wonder whether (a) we're just asking it to write very different kinds of code, or (b) I'm bad at writing LLM-friendly requirements.


Not OP but here is a verbatim prompt I put into these LLMs. I'm learning to make flutter apps, and I like to try make various UIs so I can learn how to compose some things. I agree that Gemini Ultra (aka the paid "advanced" mode) is def better than ChatGPT-4 for this prompt. Mine is a bit more terse than OP's huge prompt with numbered requirements, but I still got a super valid and meaningful response from Gemini, while GPT4 told me it was a tricky problem, and gave me some generic code snippets, that explicitly don't solve the problem asked.

> I'm building a note-taking app in flutter. I want to create a way to link between notes (like a web hyperlink) that opens a different note when a user clicks on it. They should be able to click on the link while editing the note, without having to switch modalities (eg. no edit-save-view flow nor a preview page). How can I accomplish this?

I also included a follow-up prompt after getting the first answer, which again for Gemini was super meaningful, and already included valid code to start with. Gemini also showed me many more projects and examples from the broader internet.

> Can you write a complete Widget that can implement this functionality? Please hard-code the note text below: <redacted from HN since its long>


This is useful, thanks. Since you're using this for learning, would it be fair to characterize this as asking the LLM to write code you don't already know how to write on your own?

I've definitely had success using LLMs as a learning tool. They hallucinate, but most often the output will at least point me in a useful direction.

But my day-to-day work usually involves non-exploratory coding where I already know exactly how to do what I need. Those are the tasks where I've struggled to find ways to make LLMs save me any time or effort.


> would it be fair to characterize this as asking the LLM to write code you don't already know how to write on your own?

Yea absolutely. I also use it to just write code I understand but am too lazy to write, but it's definitely effective at "show me how this works" type learning too.

> Those are the tasks where I've struggled to find ways to make LLMs save me any time or effort

Github CoPilot has an IDE integration where it can output directly into your editor. This is great for "// TODO: Unit Test for add(x, y) method when x < 0" and it'll dump out the full test for you.

Similarly useful for things like "write me a method that loops through a sorted list, and finds anything with <condition> and applies a transformation and saves it in a Map". Basically all those random helper methods and be written for you.


That last one is an interesting example. If I needed to do that, I would write something like this (in Kotlin, my daily-driver language):

    fun foo(list: List<Bar>) =
        list.filter { condition(it) }.associateWith { transform(it) }
which would take me less time to write than the prompt would.

However, if I didn't know Kotlin very well, I might have had to go look in the docs to find the associateWith function (or worse, I might not have even thought to look for it) at which point the prompt would have saved me time and taught me that the function exists.


Is there any chance you could share an example of the kind of prompt you're writing?

I'm always reluctant to write long prompts because I often find GPT4 just doesn't get it, and then I've wasted ten minutes writing a prompt


How do you interact with Gemini for coding work? I am trying to paste my code in the web interface and when I hit submit, the interface says "something went wrong" and the code does not appear in the chat window. I signed up for Gemini Advanced and that didn't help. Do you use AI Studio? I am just looking in to that now.


I've found Gemini generally equal with the .Net and HTML coding I've been doing.

I've never had Gemini give me a better result than GPT, though, so it does not surpass it for my needs.

The UI is more responsive, though, which is worth something.


> Gemini Pro seems slightly better than GPT-4 original in Human Eval (67->71).

Though they talk a bunch about how hard it was to filter out Human Eval, so this probably doesn't matter much.


I mean i don't see GPT4 watching a 44 minute movie and being able to exactly pinpoint a guy taking a paper out of his pocket..


    > The 10M context ability wipes out most RAG stack complexity immediately.
Remains to be seen.

Large contexts are not always better. For starters, it takes longer to process. But secondly, even with RAG and the large context of GPT4 Turbo, providing it a more relevant and accurate context always yields better output.

What you get with RAG is faster response times and more accurate answers by pre-filtering out the noise.


Hopefully we can get a better RAG out of it. Currently people do incredibly primitive stuff like chunking text into chunks of a fixed size and adding them to vector DB.

An actually useful RAG would be to convert text to Q&A and use Q's embeddings as an index. Large context can make use of in-context learning to make better Q&A.


A lot of people in RAG already do this. I do this with my product: we process each page and create lists of potential questions that the page would answer, and then embed that.

We also embed the actual text, though, because I found that only doing the questions resulted in inferior performance.


So in this case, what your workflow might look like is:

    1. Get text from page/section/chunk
    2. Generate possible questions related to the page/section/chunk
    3. Generate an embedding using { each possible question + page/section/chunk }
    4. Incoming question targets the embedding and matches against { question + source }
Is this roughly it? How many questions do you generate? Do you save a separate embedding for each question? Or just stuff all of the questions back with the page/section/chunk?


Right now I just throw the different questions together in a single embedding for a given chunk, with the idea that there’s enough dimensionality to capture them all. But I haven’t tested embedding each question, matching on that vector, and then returning the corresponding chunk. That seems like it’d be worth testing out.


Don't forget that Gemini also has access to the internet, so a lot of RAGging becomes pointless anyway.


Internet search is a form of RAG, though. 10M tokens is very impressive, but you're not fitting a database, let alone the entire internet into a prompt anytime soon.


You shouldn't fit an entire database in the context anyway.

btw, 10M tokens is 78 times more context window than the newest GPT-4-turbo (128K). In a way, you don't need 78 GPT-4 API calls, only one batch call to Gemini 1.5.


I don't get this why is it people think that you need to put an entire database in the short-term memory of the AI to be useful? When you work with a DB are you memorizing the entire f*cking database, no, you know the summaries of it and how to access and use it.

People also seem to forget that the average is 1b words that are read by people in their entire LIFETIME, and at 10m, with nearly 100% recall thats pretty damn amazing, i'm pretty sure I don't have perfect recall of 10m words myself lol


You certainly don't need that much context for it to be useful, but it definitely opens up a LOT more possibilities without the compromises of implementing some type of RAG. In addition, don't we want our AI to have superhuman capabilities? The ability to work on 10M+ tokens of context at a time could enable superhuman performance in many tasks. Why stop at 10M tokens? Imagine if AI could work on 1B tokens of context like you said?


It increases the use cases.

It can also be a good alternative for fine-tuning.

And the use case of a code base is a good example: if the ai understands the whole context, it can do basically everything.

Let me pay 5€ for a android app rewritten into iOS.


Well it's nice, just sad nobody can use it


This may be useful in a generalized use case, but a problem is that many of those results again will add noise.

For any use case where you want contextual results, you need to be able to either filter the search scope or use RAG to pre-define the acceptable corpus.


> you need to be able to either filter the search scope or use RAG ...

Unless you can get nearly perfect recall with millions of tokens, which is the claim made here.


> The 10M context ability wipes out most RAG stack complexity immediately.

The video queries they show take around 1 minute each, this probably burns a ton of GPU. I appreciate how clearly they highlight that the video is sped up though, they're clearly trying to avoid repeating the "fake demo" fiasco from the original Gemini videos.


The youtube video of the Multimodal analysis of a video is insane, imagine feeding in movies or tv shows and being able to autosummary or find information about them dynamically, how the hell is all this possible already? AI is moving insanely fast.


> imagine feeding in movies or tv shows

Google themselves have such a huge footprint of various businesses, that they alone would be an amazing customer for this, never mind all the other cool opportunities from third parties...

Imagine that they can ingest the entirety of YouTube and then dump that into Google Search's index AND use it to generate training data for their next LLM.

Imagine that they can hook it up to your security cameras (Nest Cam), and then ask questions about what happened last night.

Imagine that you can ask Gemini how to do something (eg. fix appliance), and it can go and look up a YouTube video on how to accomplish that ask, and explain it to you.

Imagine that it can apply summarization and descriptions to every photo AND video in your personal Google Photos library. You can ask it to find a video of your son's first steps, or a graduation/diploma walk for your 3rd child (by name) and it can actually do that.

Imagine that Google Meet video calls can have the entire convo itself fed into an LLM (live?), instead of just a transcription. You can have an AI assistant there with you that can interject and discuss, based on both the audio and video feed.


I'd love to see that applied to the Google ecosystem, the question is - why haven't they already done this?


IMO, they aren't sure how to monetize it, Google is run by the ads team.

Problem is they are jeopardizing their moat.

Google is still in a great position, they have the knowledge and lots of data to pull this off. They just have to take the risk of losing some ad revenue for a while.


Well, they just announced publicly that the technology is available. Maybe its just too new to have been productized so far.


Is 10M token context correct? The blog post I see 1M but I'm not sure if these are different things

Edit: Ah, I see, it's 1M reliably in production, up to 10M in research:

> Through a series of machine learning innovations, we’ve increased 1.5 Pro’s context window capacity far beyond the original 32,000 tokens for Gemini 1.0. We can now run up to 1 million tokens in production.

> This means 1.5 Pro can process vast amounts of information in one go — including 1 hour of video, 11 hours of audio, codebases with over 30,000 lines of code or over 700,000 words. In our research, we’ve also successfully tested up to 10 million tokens.


I know how I’m going to evaluate this model. Upload my codebase and ask it to “find all the bugs”.


How could one hour of video fit in 1M tokens? 1 hour at 30fps is 3600*30=100k frames. Each frame is converted in 256 tokens. So either they are not processing each frame, or each frame is converted into fewer tokens.


The model can probably perform fine at 1 frame per second (3600*256=921600 tokens), and they could probably use some sort of compression.


> 1. They don't talk about how they get to 10M token context

> 2. They don't talk about how they get to 10M token context

Yes. I wonder if they're using a "linear RNN" type of model like Linear Attention, Mamba, RWKV, etc.

Like Transformers with standard attention, these models train efficiently in parallel, but their compute is O(N) instead of O(N²), so in theory they can be extended to much longer sequences much efficiently. They have shown a lot of promise recently at smaller model sizes.

Does anyone here have any insight or knowledge about the internals of Gemini 1.5?


The fact they are getting perfect recall with millions of tokens rules out any of the existing linear attention methods.


I wouldn't be so sure perfect recall rules out linear RNNs, because I haven't seen any conclusive data on their ability to recall. Have you?


They do give a hint:

"This includes making Gemini 1.5 more efficient to train and serve, with a new Mixture-of-Experts (MoE) architecture."

One thing you could do with MoE is giving each expert different subsets of the input tokens. And that would definitely do what they claim here: it would allow search. You want to find where someone said "the password is X" in a 50 hour audio file, this would be perfect.

If your question is "what is the first AND last thing person X said" ... it's going to suck badly. Anything that requires taking 2 things into account that aren't right next to eachother is just not going to work.


> Anything that requires taking 2 things into account that aren't right next to eachother is just not going to work.

They kinda address that in the technical report[0]. On page 12 they show results from a "multiple needle in a haystack" evaluation.

https://storage.googleapis.com/deepmind-media/gemini/gemini_...


I would perhaps add that it's worrying. Every youtube evaluation of this model in GCP AI Studio I've seen have commented on the constant hallucinations of this model.


> One thing you could do with MoE is giving each expert different subsets of the input tokens.

Don't MoE's route tokens to experts after the attention step? That wouldn't solve the n^2 issue the attention step has.

If you split the tokens before the attention step, that would mean those tokens would have no relationship to each other - it would be like inferring two prompts in parallel. That would defeat the point of a 10M context


Is MOE then basically divide and conquer? I have no deep knowledge of this so I assumed MOE was where each expert analyzed the problem in a different way and then there was some map-reduce like operation on the generated expert results. Kinda like random forest but for inference.


> I assumed MOE was where each expert analyzed the problem in a different way

Uh sorta but not like parent described at all. You have multiple "experts" and you have a routing layer(s) that decide which expert to send it to. Usually every token is sent to at least 2. You can't just send half the tokens to one expert and half to another.

Also the "experts" are not "domain experts" - there is not a "programming expert" and an "essay expert".


Regarding how they’re getting to 10M context, I think it’s possible they are using the new SAMBA architecture.

Here’s the paper: https://arxiv.org/abs/2312.00752

And here’s a great podcast episode on it: https://www.cognitiverevolution.ai/emergency-pod-mamba-memor...


As a Brazilian, I approve that choice. Vambora amigos!


Regarding the 10M tokens context, RingAttention has been shown [0] recently (by researchers, not ML engineers in a FAANG) to be able to scale to comparable (1M) context sizes (it does take work and a lot of GPUs).

[0]: https://news.ycombinator.com/item?id=39367141


> researchers, not ML engineers in a FAANG

Why did you point out this distinction?


It means they have significantly less means (to get a lot of GPUs letting them scale up in context length) and are likely less well-versed in optimization (which also helps with scaling up)[0].

I believe those two things together are likely enough to explain the difference between a 1M context length and a 10M context length.

[0]: Which is not looking down on that particular research team, the vast majority of people have less means and optimization know-how than Google.


Probably to indicate that its research and not productized?


Re RAG aren’t you ignoring the fact that no one wants to put confidential company data into such LLM’s. Private RAG infrastructure remains a need for the same reason that privacy of data of all sorts remains a need. Huge context solves the problem for large open source context material but that’s only part of the picture.


For #1 and #2 it is some version of mixture of experts. This is mentioned in the blog post. So each expert only sees a subset of the tokens.

I imagine they have some new way to route tokens to the experts that probably computes a global context. One scalable way to compute a global context is by a state space model. This would act as a controller and route the input tokens to the MoEs. This can be computed by convolution if you make some simplifying assumptions. They may also still use transformers as well.

I could be wrong but there are some Mamba-MoEs papers that explore this idea.



There will always be more data that could be relevant than fits in a context window, and especially for multi-turn conversations, huge contexts incur huge costs.

GPT-4 Turbo, using its full 128k context, costs around $1.28 per API call.

At that pricing, 1m tokens is $10, and 10m tokens is an eye-watering $100 per API call.

Of course prices will go down, but the price advantage of working with less will remain.


I don't see a problem with this pricing. At 1m tokens you can upload the whole proceedings of a trial and ask it to draw an analysis. Paying $10 for that sounds like a steal.


Unfortunately the whole context has to be reprocessed fully for each query, which means that if you "chat" with the model you'll incur in that $10 fee for every interaction which quickly sums up.

It may still be worth it for some use cases


Of course, if you get exactly the answer you want in the first reply.


While it's hard to say what's possible on the cutting edge, historically models tend to get dumber as the context size gets bigger. So you'd get a much more intelligent analysis of a 10,000 token excerpt of the trial than a million token complete transcript of the trial. I have not spent the money testing big token sizes in GPT 4 turbo, but it would not surprise me if it gets dumber. Think of it this way, if the model is limited to 3,000 token replies, if an analysis would require a more detailed response than 3,000 tokens, it cannot provide it, it'll just give you insufficient information. What it'll probably do is ignore parts of the trial transcript because it can't analyze all that information in 3,000 tokens. And asking a followup question is another million tokens.


Would the price really increase linearly? Isn't the demands on compute and memory increasing steeper than that as a function of context length?


RAG would still be useful for cost savings assuming they charge per token, plus I'm guessing using the full-context length would be slower than using RAG to get what you need for a smaller prompt


This is going to be the real differentiator.

HN is very focused on technical feasibility (which remains to be seen!), but in every LLM opportunity, the CIO/CFO/CEO are going to be concerned with the cost modeling.

The way that LLMs are billed now, if you can densely pack the context with relevant information, you will come out ahead commercially. I don't see this changing with the way that LLM inference works.

Maybe this changes with managed vector search offerings that are opaque to the user. The context goes to a preprocessing layer, an efficient cache understands which parts haven't been embedded (new bloom filter use case?), embeds the other chunks, and extracts the intent of the prompt.


Agreed with this.

The leading ability AI (in terms of cognitive power) will, generally, cost more per token than lower cognitive power AI.

That means that at a given budget you can choose more cognitive power with fewer tokens, or less cognitive power with more tokens. For most use cases, there's no real point in giving up cognitive power to include useless tokens that have no hope of helping with a given question.

So then you're back to the question of: how do we reduce the number of tokens, so that we can get higher cognitive power?

And that's the entire field of information retrieval, which is the most important part of RAG.


The way that LLMs are billed now, if you can densely pack the context with relevant information, you will come out ahead commercially. I don't see this changing with the way that LLM inference works.

Really? Because to my understanding the compute necessary to generate a token grows linearly with the context, and doesn't the OpenAI billing reflect that by seperating prompt and output tokens?


> The 10M context ability wipes out most RAG stack complexity immediately.

This may not be true. My experience of the complexity of RAG lays in how to properly connect to various unstructured data sources and perform data transformation pipeline for large scale data set (which means GB, TB or even PB). It's in the critical path rather a "nice to have", because the quality of data and the pipeline is a major factor for the final generated the result. i.e., in RAG, the importance of R >>> G.


RE: RAG - they haven't released pricing, but if input tokens are priced at GPT-4 levels - $0.01/1K then sending 10M tokens will cost you $100.


In the announcements today they also halved the pricing of Gemini 1.0 Pro to $0.000125 / 1K characters, which is a quarter of GPT3.5 Turbo so it could potentially be a bit lower than GPT-4 pricing.


If you think the current APIs will stay that way, then you're right. But when they start offering dedicated chat instances or caching options, you could be back in the penny region.

You probably need a couple GB to cache a conversation. That's not so easy at the moment because you have to transfer that data to and from the GPUs and store the data somewhere.


The tokens need to be fed into the model along with the prompt and this takes time. Naive attention is O(N^2). They probably use at least flash attention, and likely something more exotic to their hardware.

You'll notice in their video [1] that they never show the prompts running interactively. This is for a roughly 800K context. They claim that "the model took around 60s to respond to each of these prompts".

This is not really usable as an interactive experience. I don't want to wait 1 minute for an answer each time I have a question.

[1] https://www.youtube.com/watch?v=SSnsmqIj1MI


GP's point is you can cache the state after the model processed the super long context but before it ingests your prompt.

If you are going to ask "then why don't OpenAI do it now", the answer is it takes a lot of storage (and IO) so it may not be worth it for shorter context, it adds significant complexity to the entire serving stack, and is incoherent with how OpenAI originally imagined where the "custom-ish" LLM serving game is going to - they bet on finetuning and dedicated instances, instead of long context.

The tradeoff can be reflected in the API and pricing, LLM APIs don't have to be like OpenAI's. What if you have an endpoint to generate a "cache" of your context (or really, a prefix of your prompt), billed as usual per token, then you can use your prompt prefix for a fixed price no matter how long it is?


Do you have examples of where this has been done? Based on my understanding you can do things like cache the embeddings to avoid the tokenization/embedding cost, but you will still need to do a forward pass through the model with the new user prompt and the cached context. That is where the naive O(N^2) complexity comes from and that is the cost that cannot be avoided (because the whole point is to present the next user prompt to the model along with the cached context).


> The 10M context ability wipes out most RAG stack complexity immediately.

RAG is needed for the same reason you don't `SELECT *` all of your queries.


> They don't talk about how they get to 10M token context

I don't know how either but maybe https://news.ycombinator.com/item?id=39367141

Anyway I mean, there is plenty of public research on this so it's probably just a matter of time for everyone else to catch up


Why do you think this specific variant (RingAttention)? There are so many different variants for this.

As far as I know, the problem in most cases is that while the context length might be high in theory, the actual ability to use it is still limited. E.g. recurrent networks even have infinite context, but they actually only use 10-20 frames as context (longer only in very specific settings; or maybe if you scale them up).


There are ways to test the neural network’s ability to recall from a very long sequence. For example, if you insert a random sentence like “X is Sam Altman” somewhere in the text, will the model be able to answer the question “Who is X?”, or maybe somewhat indirectly “Who is X (in another language)” or “Which sentence was inserted out of context?” “Which celebrity was mentioned in the text?”

Anyways the ability to generalize to longer context length is evidenced by such tests. If every token of the model’s output is able to answer questions in such a way that any sentence from the input would be taken into account, this gives evidence that the full context window indeed matters. Currently I find Claude 2 to perform very well on such tasks, so that sets my expectation of how a language model with an extremely long context window should look like.


> The 10M context ability wipes out most RAG stack complexity immediately.

1. People mention accuracy issues with longer contexts 2. People mention processing time issues with longer contexts 3. Something people haven't mentioned in this thread is cost -- even thought prompt tokens are usually cheaper than generated tokens, and Gemini seems to be cheaper than GPT-4, putting a whole knowledge base or 80-page document in the context is going to make every time you run that prompt quite expensive


>The 10M context ability wipes out most RAG stack complexity immediately

From a technology standpoint, maybe. From an economics standpoint, it seems like it would be quite expensive to jam the entire corpus into every single prompt.


"I really want to know how they're getting to 10M context, though."

My $5 says it's a RAG or a similar technique (hierarchical RAG comes to mind), just like all other large context LLMs.


It takes 60 seconds to process all of that context in their three.js demo, which is, I will say, not super interactive. So there is still room for RAG and other faster alternatives to narrow the context.


This might be a stupid question - even if there's no quality degradation from 10M context, will it be extremely slow in reference?


>3. The 10M context ability wipes out most RAG stack complexity immediately.

I'd imagine RAG would still be much more efficient computationally


I assume using this large of a context window instead of RAG would mean the consumption of many orders of magnitude more GPU.


RAG doesn’t go away at 10 Million tokens if you do esoteric sources like shodan API queries.


Even 1m tokens eliminate the need for RAG, unless it is for cost.


1 million might sound like a lot, but it's only a few megabytes. I would want RAG, somehow, to be able to process gigabytes or terabytes of material in a streaming fashion.


RAG will not change how many tokens LLM can produce at once.

Longer context on the other hand, could put some RAG use cases to sleep, if your instructions are like, literally a manual long, then there is no need for rag.


I think RAG could be used that do that. If you have a one time retrieval in the beginning, basically amending the prompt, then I agree with you. But there are projects (classmate doing his masters thesis project as one implementation of this) that retrieves once every few tokens and make the retrieved information available to the generation somehow. That would not take a toll on the context window.


Or accuracy


I just hope at some point we get access to mostly uncensored models. Both GPT-4 and Gemini are extremely shackled, and a slightly inferior model that hasn’t been hobbled by a very restricting preprompt would handily outperform them.


You can customize the system prompt with ChatGPT or via the completions API, just fyi.


What's RAG?


Retrieval Augmented Generation. In basic terms, it optimizes output of LLMs by using additional external data sources before answering queries. (That actually might be too basic of a description)

Here:

https://blogs.nvidia.com/blog/what-is-retrieval-augmented-ge...


Is it same as embedding? Is embedding an RAG method?


I don't think so. I think embedding is just converting token string into its numeric representation. Numeric representations of semantically similar token strings are close geometrically.

RAG is training AI to be a guy who read a lot of books. He doesn't know all of them in the context of this conversation you are having with him, but he sort of remembers where he read about the thing you are talking about and he has a library behind him into which he can reach and cite what he read verbatim thus introducing it into the context of your conversation.

I might be wrong though. I'm a newb.


Retrieval augmented generation.

> Retrieval Augmented Generation (RAG) is a technique where the capabilities of a large language model (LLM) are augmented by retrieving information from other systems and inserting them into the LLM’s context window via a prompt.

(stolen from: https://github.com/psychic-api/rag-stack)


> They are pretty clear that 1.5 Pro is better than GPT-4 in general, and therefore we have a new LLM-as-judge leader, which is pretty interesting

I fully disagree, they compare Gemini 1.5 Pro and GPT4 only on context length, not on other tasks where they compare it only to other Gemini which is a strange self-own.

I'm convinced that if they do not show the results against GPT4/Claude, it is because they do not look good.


Wake me when I can get access without handing over my texts and contacts. I opened the Gemini app on Android and that onerous privacy policy was the first experience. Worse, I didn't seem able to move past accepting giving Google the ability to hoover up my data to disable that in the settings so I just gave up and went back to ChatGPT where I at least generally have control over the data I give it.


After their giant fib with the Gemini video a few weeks back I'm not believing anything til I see it used by actual people. I hope it's that much better than GPT-4, but I'm not holding my breath there isn't an asterisk or trick hiding somewhere.


How do you know it isn't RAG?


FYI, MM is the standard for million. 10MM not 10M I’m reading all these comments confused as heck why you are excited about 10M tokens


Maybe for accountants, but for everyone else a single M is much more common.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: