> A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words) - https://platform.openai.com/tokenizer
32,000 (tokens) * 4 = 128,000 (characters)
> While a general guideline is one page is 500 words (single spaced) or 250 words (double spaced), this is a ballpark figure - https://wordcounter.net/words-per-page
Assuming (on average) one word = 5 letters, context ends up being ~50 pages (128000 / (500 * 5)).
Just to put the number of "32k tokens" into somewhat estimated context.
That's no good, wasn't my point if so. Luckily it seems like you're in the minority so far, so at least I managed to get the right feeling across to some people.
We both arrived at mostly the same result from the same question of "how many pages of text would 32k tokens be?", they basically did the calculation again albeit in a slightly different way. Just like when researchers try to reproduce the results of other's studies.
probably off topic, but part of this is "Good luck writing a grant proposal that says I want to reproduce the work of this group and get it accepted". Unless of course you are claiming ground breaking paradigm shift like evidence of unified theory or super-conductivity at room temperature.
At $0.60 for 20k prompt tokens, it's not going to be cheap so they will need to bring the price down to get broader adoption.
As far as I can tell, the initial reading in of the document (let's say a 20k token one), will be a repeated cost for each subsequent query over the document. If I have a 20k token document, and ask 10 follow-up prompts consisting of 100 tokens, that would take me to a total spend of 20k * 10 + (10 * 11)/2 * 100 = 205,500 prompt tokens, or over $6. This does not include completion tokens or the response history which would edge us closer to $8 for the chat session.
What I’ve read is that people let a kind of meta-chat run alongside the client interaction. The meta channel decides what parts of the history to retain so the primary channel doesn’t use as many resources. You could let GPT decide when it needs to see the whole context again etc. There’s a lot of interesting ways to let GPT handle it’s own scope, I think.
But aside from this, if I have a large document that I want to "chat" with, it looks like I am either chunking it and then selectively retrieving relevant subsections at question time, or I am naively dumping the whole document (that now fits in 32k) and then doing the chat, at a high cost. So 32k (and increased context size in general) does not look to be a huge gamechanger in patterns of use until cost comes down by an order of magnitude or two.
Yes, that’s one example. Relatively simple to implement yourself. Any frontend for chatgpt already does the basic thing, which is to pass the previous messages along with the prompt.
I think we may end up with first and second stage completions, where the first stage prepares the context for the second stage. The first stage can be a (tailored) gpt3.5 and the second stage can do the brainy work. That way you can actively control costs by making the first stage forward a context of a given maximum size.
Right now all of this is people tinkering with the API. If you look at the docs you will note that it doesn’t even provide chat history or any kind of session. You have to pass all context yourself. So you’re already homebrewing that, why not add some spice.
I forget if Langchain can do that, but something along those lines will exist if it doesn't already, it's too obvious not to, and too important for the free ChatGPT equivalents which will be popping up over a matter of days/weeks now that truly free versions of llama are coming out.
TL;DR the puck should get there very soon, not just Really Soon Now.
Testing some random rust code, it's about 15-20 tokens per LoC, so about 1500-2000 LoC in a 32k context.
Interestingly, using 2-space indentation, as soon as you are about 3-4 indentation levels deep you spend as many tokens on indentation as on the actual code. For example, "log::LevelFilter::Info" is 6 tokens, same as 6 consecutive spaces. There are probably a lot of easy gains here reformatting your code to use longer lines or maybe no indentation at all.
Ah, good catch, it's actually closer to 8 tokens per LoC with GPT4's toktoken, so about twice as good. Some quick testing suggests that's mostly down to better whitespace handling.
if you wanted to send 32k tokens of code, are you able to do that using a model with a 4k context limit by spreading those tokens out across multiple messages? or does it not work that way?
Not really. The API is stateless, you pass it in a whole conversation and it responds with the next message. The entire conversation including its response is limited to 32k tokens.
i’m just confused because i thought i remembered sending long chunks of code using the api, the request will fail, but then i would split it up and then it would work okay.
i guess i’m running into a different limit (not context length), or maybe i’m misremembering
The context limit is for request + response, and there is no storage in between requests (ongoing chat interactions are done by adding prior interactions to the prompt, so the whole chat – before things start falling out of history – is limited to the context window.)
I was waiting for a while, but then I found there was a page where if you selected "I want to build plugins" then you would have never seen the option to request them.
Once I filled that in I got access within a few days.
FYI, 27 times per hour is basically nothing. With GPT4 over the API, I make 2-3 completion requests a minute, for 30-60 minutes at a time, when building an LLM app. This happens for 3-4 hours per day.
At the upper bound, this would be $2 * 3 * 60 * 4 = $1440 a day.
Thankfully, I am using retriever-augmentation and context stuffing into the base 4k model, so costs are manageable.
The 32k context model cannot be deployed into a production app at this pricing as a more capable drop-in replacement for shorter-context models.
Depends heavily on your product. I can imagine there are quite a lot of use cases that have relatively infrequent API usage or highly cacheable responses.
Batch processing scales quadratically with the context size (assuming OpenAI is still using standard transformer architecture) but the batch processing of the prompt is also fast compared to generating tokens because it's batched (parallel). So I wouldn't expect effective response times to go up quadratically. At most linearly, depending on the details of how they implement inference.
That's according to this (https://lmsys.org/blog/2023-03-30-vicuna/) promotional blog post and just cited by the google memo right? Which isn't really even a doc, just a memo that was circulating inside google.
I also find it strange they don't contrast gpt4 and gpt3.5
This assessment is based largely on GPT-4 evaluation of the output. In actual use, Vicuna-13B isn't even as good as GPT-3.5, although I do have high hopes for 30B if and when they decide to make that available (or someone else trains it, since the dataset is out).
And don't forget that all the LLaMA-based models only have 2K context size. It's good enough for random chat, but you quickly bump into it for any sort of complicated task solving or writing code. Increasing this to 4K - like GPT-3.5 has - would require significantly more RAM for the same model size.
Is there a way to always stay up to date with the latest and best performing models? Perhaps it's me but I find it difficult to navigate HuggingFace and find models sorted by benchmark.
Against GPT3.5 perhaps the gaps aren’t too big for your use cases, but I wouldn’t say it’s in the GPT4 league. It looks close in the benchmarks but the difference in quality feels (to me) huge in practice. The others models are simply a lot worse.
I don't think it's expensive at all. For things that don't need to be so correct (like, unfortunately, marketing blog posts) it's a <$1 per post generator, which is very cheap to me.
For things where correctness matters, the majority of cost will still come from humans who are in charge of ensuring correctness.
Even if it was around 0.10$. This does not scale, it would need to be less than 0.01$ per generation to keep up with open source models where the cost effectively is 0$ (leaving our hardware). These open source models are still not replacing GPT4, but they are moving into that territory.
Oh really. Then show me your "open source model" that handles 32k tokens on a consumer-grade PC. Actually don't show me, show the internet. You will be the most famous man in tech world.
Well surely I can't convince you, feel free to build the next AI startup on OpenAI then, and stop caring about any possible competition out scaling you once token limits on open source models become more in line with the walled garden of Google, MS and OpenAI's high API pricing ;)
My bet is open source models (true open source without string attached) won't ever catch up OpenAI etc. I'll be really surprised if there is one that can match GPT-4 in the next 2~3 years. If you tried LLaMA and StableLM you would probably feel the same.
Considering that increasing context length is O(n^2), and that current 8k GPT-4 is already restricted to 25 prompts/3 hours, I think they will launch it at substantially higher pricing.
> current 8k GPT-4 is already restricted to 25 prompts/3 hours
I'm pretty sure they're using a 4k GPT-4 model for ChatGPT Plus, even though they only announced 8k and 32k... It can't handle more than 4k of tokens (actually a little below that, starts ignoring your last few sentences if you get close). If you check developer tools, the request to an API /models endpoint says the limit for GPT-4 is 4096. It's very unfortunate.
As far as I know it's not documented anywhere and there is no way to ask the team at ChatGPT questions. I sent them an email about it a few days after GPT-4 release and still haven't received a reply.
Another thing that annoys me is how most updates don't get a changelog entry. For whatever reason, they keep little secrets like that.
The raw chat log has the system message on top, plus "user:" and "assistant:" for each message, and im_start/im_end tokens to separate messages, hence why the visible chat context is slightly under 4k.
It will be interesting to see how far this quadratic algorithm carries in practice. Even the longest documents can only have hundreds of thousands of tokens, right?
Ideally you'd be able to put your entire codebase + documentation + jira tickets + etc. into the context. I think there is no practical limit to how many tokens would be useful for users, so the limits imposed by the model (either hard limits or just pricing) will always be a bottleneck.
I'm confused by this. Would you want to just include your codebase, documentation, etc. in some last-mile training? That way you don't need the expense of including huge amounts of context in every query. It's baked in.
I haven't tried this myself, but it is my understanding that finetuning does not work well in practice as a way of acquiring new knowledge.
There may be a middle ground between these two approaches though. If every query used the same prompt prefix (because you only update the codebase + docs occasionally) then you could put it into the model once and cache the keys and values from the attention heads. I wonder if OpenAI does this with whatever prefix they use for ChatGPT?
Yah... we really need some kind of architecture that juggles concept vectors around to external storage and does similarity search, etc, instead of forcing us to encode everything into giant tangles of coefficients.
GPT-4 seems to show that linear algebra definitely can do the job, but training is so expensive and the model gets so huge and inflexible.
It seems like having fixed format vectors of knowledge that the model can use-- denser and more precise than just incorporating tool results as tokens like OpenAI's plugin approach-- is a path forward towards extensibility and online learning.
some of the context length will be lost to waste spent on truncated posts, or are replies not considered part of context on ChatGPT? In both cases, might be worth designing a prompt, every so often, to get a reply with which to re-establish the context, thus compressing it.
Have you been using the API with GPT-3.5? I wonder if they're prioritizing access to 'active' users who appear to be trying to make something with it, over casual looky-loos.
It is. For API access you have to create an account at https://platform.openai.com. You pay per 1k token. For API access to GPT-4 put your organization (org id) on the waitlist.
Again, frustrating. I’m an antibiotics researcher with oodles of data and I need ChatGPT plugins/API to make any real progress. (I’m kind of in this intellectual space on my own, so other people can’t really help that much) I’m not sure why I’ve been on the waiting list for so long now.
I got access to ChatGPT plugins and they’re really bad, completely deserving of “alpha”. I’d be pissed if I paid 25$ for this fyi.
It’s very slow, almost 10X slower than ChatGPT
It’s integration is bad. For most plugins it doesn’t do anything smart with its API call. For example if I ask “Nearest cheap International flight”, it literally goes to Kayak and searches Nearest Cheap International Flight, if Kayak can’t handle that query, GPT can’t either.
The only plug-in with good integration is Wolfram and it makes so many syntax errors calling Wolfram that it’s thrash. Often it just syntax errors out for half my queries
I wouldn’t have minded if they spent a few more months internally testing plug-ins before rolling it out to me, seeing it’s current state. The annoying thing is the chat website automatically starts at plugins mode which is borderline unusable. So every time I have to click on the drop-down and then choose ChatGPT or GPT4.
Thanks for assuaging my FOMO a bit. I think one of the most frustrating parts is that everyone in my lab looks to me when they see this stuff on Twitter and all I can really do is shrug.
Dude, chill. Plugins are insanely new. Barely anyone has access to them. It just seems like they are widespread because they've been going viral.
The initial blog post was only just over a month ago, and it was announcing alpha access for a few users and developers:
> Today, we will begin extending plugin alpha access to users and developers from our waitlist. While we will initially prioritize a small number of developers and ChatGPT Plus users, we plan to roll out larger-scale access over time.
I think part of the anxiety, at least for me, is how fast progress is being made too. Can begin to feel like the "LET ME IN" meme, when you're watching all day the cool things those inside the magic shop can do lol. Layman btw just looking to use it to automate some volunteer work I do. Thanks for this perspective on how new this stuff is.
Can't imagine trying to keep up as a dev. Any of these tools useful for you in practice yet?
I struggle to keep up and all I need to do is understand developments well enough to simplify them in to palatable morsels for my tech skeptic colleagues in politics and non profits.
Challenging because they have a form of technology PTSD. when they hear "new technology" nft's of monkeys with 6 digit prices and peter thiel's yacht flash before their eyes and they see red.
And I can't really blame them, the rhetoric around crypto was enough to sour most non techies (in my little corner of lefty politics anyway) against the idea that any tech advancement is noteworthy. One of the first more serious individuals in politics to hear me out did so because "i sounded like one of the early linux proselytizers" lol.
Completely agree how time has slowed. I rotate between absolute giddy anticipation at our future thanks to the tech and nihilistic doomerism. Even as a hobbyist though I knew to take this seriously since I saw robert miles talk about gpt 2 in 2017(?) and note there's zero sign of these things plateauing in ability simply by ramping up parameter count.
I've gone on long enough but that live stream felt like the intro to a sci fi movie at points. Can't wait to have multi modal and plugins rolled out.
Yes! ChatGPT is very useful at answering a lot of syntax related programming questions and GPT-4 can do decent codegen for simple things.
I expect that in the next 5yrs developer workflows will completely change based on all the LLM stuff.
I think it's always difficult to tell if new tech is just hype or will have real impact, but it really feels to me like LLMs will have real impact. Maybe not as much as they are being hyped, but definitely legit impact. There's a possibility of even greater impact than the hype as well.
Try OpenAI services in Azure. We were added to a waitlist but got approved a week later. Had 32k for a few weeks now but still on the waitlist for plugins.
> I feel like this just killed a few small startups who were trying to offer more context.
Those startups killed themselves. A 32K context was advertised as a feature to be rolled out the same day GPT-4 came out.
Also - what startups are getting even remotely close to 32K context at GPT-4’s parameter count? All I’ve seen is attempts to use KNN over a database to artificially improve long term recall.
Depends on the use case. Performance quickly tanks when you get to high token count; it's a slowdown I believe the various summarizers/context extenders mostly avoid.
(Also UI probably tanks too. I dread what the OpenAI Playground will do when you start actually using 32k model for real, like throwing a 15k token long prompt at it. ChatGPT UI has no chance.)
Does anyone have any examples of promoting to feed such a large amount of tokens? For example, would you use something like “I am going to send you an entire codebase, with the filename and path, followed by the file content. Here is the first of 239 files: …”
It works really well, you can tell it to implement new features or mutate parts of the code and it having the entire (or a lot of) the code in its context really improves the output.
The biggest caveat: shit is expensive! A full 32k token request will run you like $2, if you do dialog back and forth you can rack up quite the bill quickly.
If it was 10x cheaper, I would use nothing else, having a large context window is that much of a game changer. As it stands, I _very_ carefully construct the prompt and move the conversation out of the 32k into the 8k model as fast as I can to save cost.
How it calculates the price? I thought that once you load the content (32k token request / 2$) it will remember the context so you can ask questions much cheaper.
It does not have memory outside the context window. If you want to have a back-and-forth with it about a document, that document must be provided in the context (along with your other relevant chat history) with every request.
This is why it's so easy to burn up lots of tokens very fast.
I already do this with the current context limits. I include a few of my relevant source files before my prompt in ChatGPT. It work’s unreasonably well.
Something like the following
Here is the Template class:
…
Here is an example component:
…
Here is an example Input element:
…
I need to create another input element that allows me to select a number from a drop down between 1/32 and 24/32 in 1/32 increments
You could see the legal impact of your actions before you take them. You could template out an Operating system and have it fill in the blanks. You could rewrite entire literary arts, in the authors style, to cater to your reading style or story preferences.
If I wanted to have a conversation about it, and you wanted to charge me a flat fee per utterance on the basis that you had to reread the text anew every time, I wouldn't be paying you at all.
If we were having such conversation via e-mail/IM and I learned that you're just asking me questions one by one in your replies, questions which you could've easily included in your first e-mail - then believe me when I say it, I would charge you the same way OpenAI does, and I'd throw in an extra 50% fee for being inconsiderate and not knowing how to communicate effectively.
Yeah, I can see this being useful for one-off queries, but don't they want to offer some sort of final training ("last-mile" I called it in another comment. I can't remember what the proper term is.) to companies to customize the model so it already has all the context they need baked in to every query?
They used to offer exactly this for fine tuning models. Never offered it after ChatGPT, I think the difficulty comes with fine tuning RLHF models, not obvious how to correctly do this.
It's unfortunate. There are some online tutorials that instruct you to embed all your code and perform top-k cosine similarity searches, populating the responses accordingly.
It's quite interesting if you can tweak your search just right. You can even use less tokens than 8K even!
I think he's talking about computational efficiency. If you're loading in 29k tokens and you're expecting to use those again, you wouldn't need to do the whole matrix multiplication song and dance again if you just kept the old buffers around for the next prompt.
What's your usage and stated use case? I got access for my company account, but I'm pretty sure that's because we've built and shipped product using their API.
I applied for personal use, I stated that I'd like to experiment with its coding abilities. Yeah it seems that they prioritized companies making GPT-4 products first.
I joined the GPT4 waitlist 2 or 3 days after it was released (around mid-march) and finally got access last week. I also applied for personal use and wrote one or two sentences about wanting to experiment / compare it to other models. So they definitely do give the API access to regular folks as well, no idea how they prioritize it though. I've been a paying customer of ChatGPT plus for three months now which might have helped.
I'd just like to see GPT4 more available, even on the free chatGPT, although I wonder if that will ever fully come with ChatGPT getting so much use and GPT 3.5 being cheaper to run.
Plus seems expensive to me and it is still rate limited quite a lot.
I guess it's going to take further optimisation to make it worthwhile for OpenAI.
That example is really shockingly good, indeed. I'm not always convinced that GPTs can be properly artistic, most things lack soul and read like some
rambling Dan Brown on amphetamine... this DFW works very well.
It gave me the same vague feeling of annoyance and disgust at the "look how smart I am" linguistic obstreperousness I get when reading the real deal.
Does this purely affect the amount of tokens that can be fed in and retained in context during a session?
The output from that prompt seems spectacular, so I'm wondering if there are any other differences.
I just tried the same prompt with GPT-4 and the style was much more GPT-like, what I'm used to, not near the same quality as in the OP, although maybe it's just luck?
There's the token limit-- the maximum number of tokens to respond.
There's also the token context: how many words into the "past" it considers when formulating the next word.
They're different things. You can generate very, very long responses with a model with a short context window; it just will have amnesia about what it said earlier-- though OpenAI often seems to restrict you / prevent you from having context scroll off in this way.
you can just use the API, if you set completions to 0 it will return the token count. Then you can just remove the oldest message until it's under any number. I picked 3k to allow 1k for the reply
No, they would call that out specifically in the model name. It’s just a further snapshot so you don’t have to jump straight to the next finetuned version without testing your app.
Yep, if you sign up via Azure OpenAI Service, you might get access sooner. Same exact API, just served directly through Azure, and likely to be maintained for longer.
This is completely separate from LoRA. This is how much stuff you can give it in the prompt. You can give it now whole chapters of books to summarize for example.
LoRA is for adapting the model to a certain model. It usually means you need to give it shorter prompts, but for book summarization, it wouldn't help.
LoRA probably does not affect the models biggest bottle neck. The attention mechanism. Original transformer was O(n^2d) where n is the query length and d was the cardinal of all the tokens
32,000 (tokens) * 4 = 128,000 (characters)
> While a general guideline is one page is 500 words (single spaced) or 250 words (double spaced), this is a ballpark figure - https://wordcounter.net/words-per-page
Assuming (on average) one word = 5 letters, context ends up being ~50 pages (128000 / (500 * 5)).
Just to put the number of "32k tokens" into somewhat estimated context.