Hacker News new | past | comments | ask | show | jobs | submit login
It looks like GPT-4-32k is rolling out (community.openai.com)
261 points by freediver on May 6, 2023 | hide | past | favorite | 188 comments



> A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words) - https://platform.openai.com/tokenizer

32,000 (tokens) * 4 = 128,000 (characters)

> While a general guideline is one page is 500 words (single spaced) or 250 words (double spaced), this is a ballpark figure - https://wordcounter.net/words-per-page

Assuming (on average) one word = 5 letters, context ends up being ~50 pages (128000 / (500 * 5)).

Just to put the number of "32k tokens" into somewhat estimated context.


32k tokens, 3/4 of 32k is 24k words, each page average is 500 or 0.5k words, so that's basically 24k / .5k = 24 x 2 =~48 pages.


That's great, imagine if more researchers were as excited to reproduce results as you are? The world would be much better.


This comes across as patronizing to me. "Who's a good little verifier. You are!"


That's no good, wasn't my point if so. Luckily it seems like you're in the minority so far, so at least I managed to get the right feeling across to some people.


Forgive my ignorance, but what is the relevance of this to the above comment?


We both arrived at mostly the same result from the same question of "how many pages of text would 32k tokens be?", they basically did the calculation again albeit in a slightly different way. Just like when researchers try to reproduce the results of other's studies.


probably off topic, but part of this is "Good luck writing a grant proposal that says I want to reproduce the work of this group and get it accepted". Unless of course you are claiming ground breaking paradigm shift like evidence of unified theory or super-conductivity at room temperature.


This is a better calculator, because it also supports GPT-4:

https://tiktokenizer.vercel.app/

Not sure, why they don't support GPT-4 on their own website.


At $0.60 for 20k prompt tokens, it's not going to be cheap so they will need to bring the price down to get broader adoption.

As far as I can tell, the initial reading in of the document (let's say a 20k token one), will be a repeated cost for each subsequent query over the document. If I have a 20k token document, and ask 10 follow-up prompts consisting of 100 tokens, that would take me to a total spend of 20k * 10 + (10 * 11)/2 * 100 = 205,500 prompt tokens, or over $6. This does not include completion tokens or the response history which would edge us closer to $8 for the chat session.


What I’ve read is that people let a kind of meta-chat run alongside the client interaction. The meta channel decides what parts of the history to retain so the primary channel doesn’t use as many resources. You could let GPT decide when it needs to see the whole context again etc. There’s a lot of interesting ways to let GPT handle it’s own scope, I think.


Is this anything like what Langchain supports which is various strategies for chat history compression (summarisation)?

https://www.pinecone.io/learn/langchain-conversational-memor...

But aside from this, if I have a large document that I want to "chat" with, it looks like I am either chunking it and then selectively retrieving relevant subsections at question time, or I am naively dumping the whole document (that now fits in 32k) and then doing the chat, at a high cost. So 32k (and increased context size in general) does not look to be a huge gamechanger in patterns of use until cost comes down by an order of magnitude or two.


Yes, that’s one example. Relatively simple to implement yourself. Any frontend for chatgpt already does the basic thing, which is to pass the previous messages along with the prompt.

I think we may end up with first and second stage completions, where the first stage prepares the context for the second stage. The first stage can be a (tailored) gpt3.5 and the second stage can do the brainy work. That way you can actively control costs by making the first stage forward a context of a given maximum size.


Is this a native feature? Or something home brewed? If the second, how to use it?


Right now all of this is people tinkering with the API. If you look at the docs you will note that it doesn’t even provide chat history or any kind of session. You have to pass all context yourself. So you’re already homebrewing that, why not add some spice.


I forget if Langchain can do that, but something along those lines will exist if it doesn't already, it's too obvious not to, and too important for the free ChatGPT equivalents which will be popping up over a matter of days/weeks now that truly free versions of llama are coming out.

TL;DR the puck should get there very soon, not just Really Soon Now.


> At $0.60 for 20k prompt tokens

Is it $0.60 even for the input tokens?


Yes, its $0.03/1K for prompt and $0.06/1K for response, which is $0.6/20K for prompt, $1.2/20K for response.


If it’s not any faster, I’m thinking that how long you’re willing to wait for an answer will be the practical bottleneck?

38 seconds for that example.


I wonder how many LoC that is on average. Is there an average for LoC? It’s probably based on the language…


Testing some random rust code, it's about 15-20 tokens per LoC, so about 1500-2000 LoC in a 32k context.

Interestingly, using 2-space indentation, as soon as you are about 3-4 indentation levels deep you spend as many tokens on indentation as on the actual code. For example, "log::LevelFilter::Info" is 6 tokens, same as 6 consecutive spaces. There are probably a lot of easy gains here reformatting your code to use longer lines or maybe no indentation at all.


Make sure you are testing with the tiktokenizer: https://news.ycombinator.com/item?id=35460648


Ah, good catch, it's actually closer to 8 tokens per LoC with GPT4's toktoken, so about twice as good. Some quick testing suggests that's mostly down to better whitespace handling.


> There are probably a lot of easy gains here reformatting your code to use longer lines or maybe no indentation at all.

Using tabs for indentation needs fewer tokens than using multiple spaces for indentation.


Curious how this point in the tabs vs. spaces debate re-emerges



How do this even work if the code also uses external libraries?


A LoC is ~7 tokens thanks to the new toktoken tokenization in GPT-4.

32k is ~4.5k LoC


0.75 * 32,000 = 24,000 words is faster and more direct


Thanks, math was never my strong suit and I was writing the comment as I was calculating towards the results, never refined and raw, as it should :)


Unfortunately, in CJK languages, it averages to slightly more than one token per character.


I suppose we have a quite objective way of measuring and comparing semantic density of natural languages.


if you wanted to send 32k tokens of code, are you able to do that using a model with a 4k context limit by spreading those tokens out across multiple messages? or does it not work that way?


Not really. The API is stateless, you pass it in a whole conversation and it responds with the next message. The entire conversation including its response is limited to 32k tokens.


i’m just confused because i thought i remembered sending long chunks of code using the api, the request will fail, but then i would split it up and then it would work okay.

i guess i’m running into a different limit (not context length), or maybe i’m misremembering


The context limit is for request + response, and there is no storage in between requests (ongoing chat interactions are done by adding prior interactions to the prompt, so the whole chat – before things start falling out of history – is limited to the context window.)


I feel like this just killed a few small startups who were trying to offer more context.

Also, I pay for ChatGPT but I have none of the new features except for GPT4. Very frustrating.


Same here: https://twitter.com/arbuge/status/1654288169397805057

The really odd thing is that I was given GPT-4 with browsing alpha enabled - for a single session last week.

As soon as I reloaded the page, it was gone. Since then the picture has reverted back to the above.

Twitter has become a bit painful to read these days, with all the AI influencers posting about what GPT-4 and plugins, code interpreter etc. can do.


I was waiting for a while, but then I found there was a page where if you selected "I want to build plugins" then you would have never seen the option to request them.

Once I filled that in I got access within a few days.

https://openai.com/waitlist/plugins

If you are the person to say: "I am a developer and want to build a plugin"

Then it is likely you missed the option to request which plugins you want access to.


OK thanks for the pointer I submited there for every plugin now as a not-developer to see if that helps.

I stopped paying the pro because without plugins it didnt do that much tbh


Yeah. It's weird, I've signed up when it came up and....nothing. :(


32k context is $1.92 for each request.


For reference, a dev making USD 100k/year and working about 240 days a year, 8 hours/day = total of 1920 hours, or about USD 52/hour, USD 416/day

52/1.92 = 27 416/1.92 = 217

So using GPT-4 with 32k tokens, 27 times per hour, or 217 times per day, in terms of cost, is approximately the equivalent of another dev


FYI, 27 times per hour is basically nothing. With GPT4 over the API, I make 2-3 completion requests a minute, for 30-60 minutes at a time, when building an LLM app. This happens for 3-4 hours per day.

At the upper bound, this would be $2 * 3 * 60 * 4 = $1440 a day.

Thankfully, I am using retriever-augmentation and context stuffing into the base 4k model, so costs are manageable.

The 32k context model cannot be deployed into a production app at this pricing as a more capable drop-in replacement for shorter-context models.


Depends heavily on your product. I can imagine there are quite a lot of use cases that have relatively infrequent API usage or highly cacheable responses.


> retriever-augmentation and context stuffing

Care you elaborate? This sounds very interesting & useful. Just anything about the setup and implementation would be super helpful.



That's a lot of requests.

Not that it matters for the calculation, but i wonder how long such a request (ingesting 32k tokens and responding with a similar amount) would take.

At the speed of regular ChatGPT take would take a good while.


Batch processing scales quadratically with the context size (assuming OpenAI is still using standard transformer architecture) but the batch processing of the prompt is also fast compared to generating tokens because it's batched (parallel). So I wouldn't expect effective response times to go up quadratically. At most linearly, depending on the details of how they implement inference.


Is it prorated for the actual context used for each request?


Yes, it's not a fixed price: https://openai.com/pricing.


Yes


It's exceedingly expensive and must surely come down over time.


Those startups will move on to open source models because OpenAI api calls with 32k token contexts are way too expensive.


What is the latest in conversational models that allow GPT3 like (or close) performance w.r.t running things locally?


Apparently Vicuna 13B is quite good according to Google's own leaked docs.

https://twitter.com/jelleprins/status/1654197282311491592


That's according to this (https://lmsys.org/blog/2023-03-30-vicuna/) promotional blog post and just cited by the google memo right? Which isn't really even a doc, just a memo that was circulating inside google.

I also find it strange they don't contrast gpt4 and gpt3.5


This assessment is based largely on GPT-4 evaluation of the output. In actual use, Vicuna-13B isn't even as good as GPT-3.5, although I do have high hopes for 30B if and when they decide to make that available (or someone else trains it, since the dataset is out).

And don't forget that all the LLaMA-based models only have 2K context size. It's good enough for random chat, but you quickly bump into it for any sort of complicated task solving or writing code. Increasing this to 4K - like GPT-3.5 has - would require significantly more RAM for the same model size.


Is there a way to always stay up to date with the latest and best performing models? Perhaps it's me but I find it difficult to navigate HuggingFace and find models sorted by benchmark.


Honestly, I just read hackernews :).


HN posts are not always in chronological order.


I didn't say it was the best way, just the way I'm doing it right now :).


I check r/LocalLlama


GPT3 is dated so many open source models are competitive with it, but Vicuna 13b is supposed to be competitive with GPT4


Against GPT3.5 perhaps the gaps aren’t too big for your use cases, but I wouldn’t say it’s in the GPT4 league. It looks close in the benchmarks but the difference in quality feels (to me) huge in practice. The others models are simply a lot worse.


Interesting. Have you tried StableVicuna?


No, is it worth a try? I didn’t see a lot of hype about it so I didn’t try it.


I don't think it's expensive at all. For things that don't need to be so correct (like, unfortunately, marketing blog posts) it's a <$1 per post generator, which is very cheap to me.

For things where correctness matters, the majority of cost will still come from humans who are in charge of ensuring correctness.


Even if it was around 0.10$. This does not scale, it would need to be less than 0.01$ per generation to keep up with open source models where the cost effectively is 0$ (leaving our hardware). These open source models are still not replacing GPT4, but they are moving into that territory.


Oh really. Then show me your "open source model" that handles 32k tokens on a consumer-grade PC. Actually don't show me, show the internet. You will be the most famous man in tech world.


Well surely I can't convince you, feel free to build the next AI startup on OpenAI then, and stop caring about any possible competition out scaling you once token limits on open source models become more in line with the walled garden of Google, MS and OpenAI's high API pricing ;)


My bet is open source models (true open source without string attached) won't ever catch up OpenAI etc. I'll be really surprised if there is one that can match GPT-4 in the next 2~3 years. If you tried LLaMA and StableLM you would probably feel the same.


Use cases for individual people are ok but it's far too expensive to deploy into your SaaS where a large number of users will use it.


Considering that increasing context length is O(n^2), and that current 8k GPT-4 is already restricted to 25 prompts/3 hours, I think they will launch it at substantially higher pricing.


> current 8k GPT-4 is already restricted to 25 prompts/3 hours

I'm pretty sure they're using a 4k GPT-4 model for ChatGPT Plus, even though they only announced 8k and 32k... It can't handle more than 4k of tokens (actually a little below that, starts ignoring your last few sentences if you get close). If you check developer tools, the request to an API /models endpoint says the limit for GPT-4 is 4096. It's very unfortunate.


Ah this explains a lot. I couldn't understand why I couldn't get close to the ~12 pages that everyone was saying 8,000 tokens implied.


As far as I know it's not documented anywhere and there is no way to ask the team at ChatGPT questions. I sent them an email about it a few days after GPT-4 release and still haven't received a reply.

Another thing that annoys me is how most updates don't get a changelog entry. For whatever reason, they keep little secrets like that.


Their PR is terrible and I get the impression that they wish their own users would “just go away”.

Every time I see a company act like this, more responsive and truly open competition eventually eats their lunch.


The raw chat log has the system message on top, plus "user:" and "assistant:" for each message, and im_start/im_end tokens to separate messages, hence why the visible chat context is slightly under 4k.



Your second link has the immediate comment "Gpt3 includes dense attention layers that are n^2". So it's not at all unlikely.


GPT3 was released 3 years ago now. There have been major advancements in scaling attention so it would be strange if they didn't use some of them


It doesn't matter how many major advancements they made in scaling, as long as one component is O(n^2) or higher.


It's not the scale itself, it's the scaling architecture.


The same applies.


It will be interesting to see how far this quadratic algorithm carries in practice. Even the longest documents can only have hundreds of thousands of tokens, right?


Ideally you'd be able to put your entire codebase + documentation + jira tickets + etc. into the context. I think there is no practical limit to how many tokens would be useful for users, so the limits imposed by the model (either hard limits or just pricing) will always be a bottleneck.


I'm confused by this. Would you want to just include your codebase, documentation, etc. in some last-mile training? That way you don't need the expense of including huge amounts of context in every query. It's baked in.


I haven't tried this myself, but it is my understanding that finetuning does not work well in practice as a way of acquiring new knowledge.

There may be a middle ground between these two approaches though. If every query used the same prompt prefix (because you only update the codebase + docs occasionally) then you could put it into the model once and cache the keys and values from the attention heads. I wonder if OpenAI does this with whatever prefix they use for ChatGPT?


Yeah there's really three options here... Throw everything in context, fine tune, or add external search a la RETRO.

The latter is definitely the cheapest option; updates are trivial.


Yah... we really need some kind of architecture that juggles concept vectors around to external storage and does similarity search, etc, instead of forcing us to encode everything into giant tangles of coefficients.

GPT-4 seems to show that linear algebra definitely can do the job, but training is so expensive and the model gets so huge and inflexible.

It seems like having fixed format vectors of knowledge that the model can use-- denser and more precise than just incorporating tool results as tokens like OpenAI's plugin approach-- is a path forward towards extensibility and online learning.


some of the context length will be lost to waste spent on truncated posts, or are replies not considered part of context on ChatGPT? In both cases, might be worth designing a prompt, every so often, to get a reply with which to re-establish the context, thus compressing it.


It’s been available on Azure in preview. Pricing is double the 8K model.


MosaicML StoryWriter 65K model just released a day or two ago.


https://www.mosaicml.com/blog/mpt-7b 65k+ context window, open source, open weights


Same. No plugins or GPT-4 API for me despite signing up for the waiting lists on the day they were announced.


Have you been using the API with GPT-3.5? I wonder if they're prioritizing access to 'active' users who appear to be trying to make something with it, over casual looky-loos.


Paying for chatgpt I believe is separate from API access


It is. For API access you have to create an account at https://platform.openai.com. You pay per 1k token. For API access to GPT-4 put your organization (org id) on the waitlist.


Finally got gpt4 api access. Now I can cancel my ChatGPT plus sub and save a bunch of cash by just using a local client.


Again, frustrating. I’m an antibiotics researcher with oodles of data and I need ChatGPT plugins/API to make any real progress. (I’m kind of in this intellectual space on my own, so other people can’t really help that much) I’m not sure why I’ve been on the waiting list for so long now.


I got access to ChatGPT plugins and they’re really bad, completely deserving of “alpha”. I’d be pissed if I paid 25$ for this fyi.

It’s very slow, almost 10X slower than ChatGPT

It’s integration is bad. For most plugins it doesn’t do anything smart with its API call. For example if I ask “Nearest cheap International flight”, it literally goes to Kayak and searches Nearest Cheap International Flight, if Kayak can’t handle that query, GPT can’t either.

The only plug-in with good integration is Wolfram and it makes so many syntax errors calling Wolfram that it’s thrash. Often it just syntax errors out for half my queries

I wouldn’t have minded if they spent a few more months internally testing plug-ins before rolling it out to me, seeing it’s current state. The annoying thing is the chat website automatically starts at plugins mode which is borderline unusable. So every time I have to click on the drop-down and then choose ChatGPT or GPT4.


Thanks for assuaging my FOMO a bit. I think one of the most frustrating parts is that everyone in my lab looks to me when they see this stuff on Twitter and all I can really do is shrug.


I use the API for anything I can't do with Bing Chat, but I've found Bing Chat to be quite useful.

For code, I use phind.com.

https://www.phind.com/tutorial


Dude, chill. Plugins are insanely new. Barely anyone has access to them. It just seems like they are widespread because they've been going viral.

The initial blog post was only just over a month ago, and it was announcing alpha access for a few users and developers:

> Today, we will begin extending plugin alpha access to users and developers from our waitlist. While we will initially prioritize a small number of developers and ChatGPT Plus users, we plan to roll out larger-scale access over time.

https://openai.com/blog/chatgpt-plugins

We are literally 1 month into the alpha of plugins.


I think part of the anxiety, at least for me, is how fast progress is being made too. Can begin to feel like the "LET ME IN" meme, when you're watching all day the cool things those inside the magic shop can do lol. Layman btw just looking to use it to automate some volunteer work I do. Thanks for this perspective on how new this stuff is.


I completely agree, I feel the same way as a dev. GPT-4 is not even 2 months old.

The developer livestream was on March 14th: https://www.youtube.com/live/outcGtbnMuQ?feature=share.

The time since GPT-4 already feels something like 6 months. So far I'm perpetually feeling behind.


Can't imagine trying to keep up as a dev. Any of these tools useful for you in practice yet?

I struggle to keep up and all I need to do is understand developments well enough to simplify them in to palatable morsels for my tech skeptic colleagues in politics and non profits.

Challenging because they have a form of technology PTSD. when they hear "new technology" nft's of monkeys with 6 digit prices and peter thiel's yacht flash before their eyes and they see red.

And I can't really blame them, the rhetoric around crypto was enough to sour most non techies (in my little corner of lefty politics anyway) against the idea that any tech advancement is noteworthy. One of the first more serious individuals in politics to hear me out did so because "i sounded like one of the early linux proselytizers" lol.

Completely agree how time has slowed. I rotate between absolute giddy anticipation at our future thanks to the tech and nihilistic doomerism. Even as a hobbyist though I knew to take this seriously since I saw robert miles talk about gpt 2 in 2017(?) and note there's zero sign of these things plateauing in ability simply by ramping up parameter count.

I've gone on long enough but that live stream felt like the intro to a sci fi movie at points. Can't wait to have multi modal and plugins rolled out.


Yes! ChatGPT is very useful at answering a lot of syntax related programming questions and GPT-4 can do decent codegen for simple things.

I expect that in the next 5yrs developer workflows will completely change based on all the LLM stuff.

I think it's always difficult to tell if new tech is just hype or will have real impact, but it really feels to me like LLMs will have real impact. Maybe not as much as they are being hyped, but definitely legit impact. There's a possibility of even greater impact than the hype as well.


I can’t believe it’s only been 1 month. It feels like 3-4 somehow.


Try OpenAI services in Azure. We were added to a waitlist but got approved a week later. Had 32k for a few weeks now but still on the waitlist for plugins.


> I feel like this just killed a few small startups who were trying to offer more context.

Those startups killed themselves. A 32K context was advertised as a feature to be rolled out the same day GPT-4 came out.

Also - what startups are getting even remotely close to 32K context at GPT-4’s parameter count? All I’ve seen is attempts to use KNN over a database to artificially improve long term recall.


Depends on the use case. Performance quickly tanks when you get to high token count; it's a slowdown I believe the various summarizers/context extenders mostly avoid.

(Also UI probably tanks too. I dread what the OpenAI Playground will do when you start actually using 32k model for real, like throwing a 15k token long prompt at it. ChatGPT UI has no chance.)


It's Hella expensive so I think they are ok for now

Until they cut down the cost then they should worry yeah


Honestly for the firms that would use it, for example finance or legal, it's very reasonable.


Does anyone have any examples of promoting to feed such a large amount of tokens? For example, would you use something like “I am going to send you an entire codebase, with the filename and path, followed by the file content. Here is the first of 239 files: …”


I've had access to the 32k model for a bit and I've been using this to collect and stuff codebases into the context: https://github.com/mpoon/gpt-repository-loader

It works really well, you can tell it to implement new features or mutate parts of the code and it having the entire (or a lot of) the code in its context really improves the output.

The biggest caveat: shit is expensive! A full 32k token request will run you like $2, if you do dialog back and forth you can rack up quite the bill quickly. If it was 10x cheaper, I would use nothing else, having a large context window is that much of a game changer. As it stands, I _very_ carefully construct the prompt and move the conversation out of the 32k into the 8k model as fast as I can to save cost.


Do you use it for proprietary code and if so, you don’t feel weird about it ?


Not weirder than using Github, or Outlook or Slack or whatever.


I wouldn't feel weird about it. The risks - someone stealing know-how / someone finding a security hole - are negligible.


How it calculates the price? I thought that once you load the content (32k token request / 2$) it will remember the context so you can ask questions much cheaper.


It does not have memory outside the context window. If you want to have a back-and-forth with it about a document, that document must be provided in the context (along with your other relevant chat history) with every request.

This is why it's so easy to burn up lots of tokens very fast.


Here's someone that passed a 23 page congressional document:

https://twitter.com/SullyOmarr/status/1654576774976770048


I already do this with the current context limits. I include a few of my relevant source files before my prompt in ChatGPT. It work’s unreasonably well.

Something like the following

Here is the Template class:

Here is an example component:

Here is an example Input element:

I need to create another input element that allows me to select a number from a drop down between 1/32 and 24/32 in 1/32 increments


You could see the legal impact of your actions before you take them. You could template out an Operating system and have it fill in the blanks. You could rewrite entire literary arts, in the authors style, to cater to your reading style or story preferences.


I think what's more important is the end part of your prompt, which explains what was previously described and includes a request/question.


Sounds like a return to the days of download managers or file splitters :)


Almost $2 if you want to use the full context length (32 * $.06). Yikes.


:D


The more token capacity that's added the more wasteful it seems to have to use this statelessly. Is there any avoiding this?

Wonderous as this new tech is, it seems a bit much to be paying $2 a question in a conversation about a 32k token text.


As a human if you have to present me 32k tokens and I have to give you an answer, you would probably have to pay me more than $2


If I wanted to have a conversation about it, and you wanted to charge me a flat fee per utterance on the basis that you had to reread the text anew every time, I wouldn't be paying you at all.


If we were having such conversation via e-mail/IM and I learned that you're just asking me questions one by one in your replies, questions which you could've easily included in your first e-mail - then believe me when I say it, I would charge you the same way OpenAI does, and I'd throw in an extra 50% fee for being inconsiderate and not knowing how to communicate effectively.


> questions which you could've easily included in your first e-mail

That's not really how conversation/chat works is it?


Have you seen how lawyers bill for their time?


Yeah, I can see this being useful for one-off queries, but don't they want to offer some sort of final training ("last-mile" I called it in another comment. I can't remember what the proper term is.) to companies to customize the model so it already has all the context they need baked in to every query?


They used to offer exactly this for fine tuning models. Never offered it after ChatGPT, I think the difficulty comes with fine tuning RLHF models, not obvious how to correctly do this.



As far as I know it's not.


It's unfortunate. There are some online tutorials that instruct you to embed all your code and perform top-k cosine similarity searches, populating the responses accordingly.

It's quite interesting if you can tweak your search just right. You can even use less tokens than 8K even!


The usage needs to be for high value queries.

Using it on a simple conversation is not its intended purpose, that's like using a supercomputer to play pong.


Handle the state on the application side...

It is like complaining that HTTP is limiting because it is stateless. Build state on top of it.


I think he's talking about computational efficiency. If you're loading in 29k tokens and you're expecting to use those again, you wouldn't need to do the whole matrix multiplication song and dance again if you just kept the old buffers around for the next prompt.


I don't think this can necessarily be optimized at least with how the models work right now


You can ask multiple/multipart questions.


Of course I still don't even have the basic GPT-4 after nearly two months of waiting.


You pay for Plus and don't have it? Maybe try canceling and subscribing on another account. I got it immediately after I subscribed.


No, I'm talking about the API waitlist. Still, I'm using https://nat.dev/ to access GPT-4 so I don't care that much anymore.


Bummer -- I got access after 2-3 weeks.

I haven't actually even hit the 8k limit yet, and even experimenting with 32k is pretty expensive, so I'm not sure what I'd do with it.


I signed up for API use as soon as it was released and just got access yesterday, so they’re still rolling it out it I guess.


I tried to checkout nat.dev, but it wants me to create an account to see anything at all. What is nat.dev please?


This is a paid LLM playground site with many models including GPT-4 and Claude-v1.


I applied the first day and just got API access a few days ago.

It is strange they can roll out 32k for some, while not even having 8k for everyone yet.


What's your usage and stated use case? I got access for my company account, but I'm pretty sure that's because we've built and shipped product using their API.


I applied for personal use, I stated that I'd like to experiment with its coding abilities. Yeah it seems that they prioritized companies making GPT-4 products first.


I joined the GPT4 waitlist 2 or 3 days after it was released (around mid-march) and finally got access last week. I also applied for personal use and wrote one or two sentences about wanting to experiment / compare it to other models. So they definitely do give the API access to regular folks as well, no idea how they prioritize it though. I've been a paying customer of ChatGPT plus for three months now which might have helped.


Waiting for gpt4 turbo


Same .. gpt-4 is a very harsh trade off of time for products.


What is turbo in this context?


Gpt3.5 turbo it was like 10x cheaper than a similar version. Gpt4 is like 100x more expensive then 3.5 turbo


A model optimized for inference speed instead of raw accuracy.


GPT-4 but pruned down to GPT-3's or GPT-2's size.


I'd just like to see GPT4 more available, even on the free chatGPT, although I wonder if that will ever fully come with ChatGPT getting so much use and GPT 3.5 being cheaper to run.

Plus seems expensive to me and it is still rate limited quite a lot.

I guess it's going to take further optimisation to make it worthwhile for OpenAI.


Do the output limits change? If I give it an entire codebase, could it potentially spit out an entire codebase?

I’m wondering if this is a quick way to get to $5 each round trip (send and receive)


I love the prompt and yes it does very well at mimicking DFW. It’s kind of weirding me out in a Year of the Trial-Size Dove Bar kind of way.


That example is really shockingly good, indeed. I'm not always convinced that GPTs can be properly artistic, most things lack soul and read like some rambling Dan Brown on amphetamine... this DFW works very well.

It gave me the same vague feeling of annoyance and disgust at the "look how smart I am" linguistic obstreperousness I get when reading the real deal.


the output is rather fantastic it's mind numbing. would DFW think this is a solid example of prose?


Does this purely affect the amount of tokens that can be fed in and retained in context during a session?

The output from that prompt seems spectacular, so I'm wondering if there are any other differences.

I just tried the same prompt with GPT-4 and the style was much more GPT-like, what I'm used to, not near the same quality as in the OP, although maybe it's just luck?


The prompts in the example on that page are pretty short... hardly taking advantage of the longer context window.

I'm actually not sure if longer responses can be expected with the 32k vs 8k models. Anyone from OpenAI care to comment on that?


The token limit is for the whole conversation, system, assistant and user messages all count.

Once the limit is hit it will stop, sometimes mid sentence.

I have a slack bot at work with a system document which is a little over 1k tokens, meaning there is around 3k tokens left for questions and replies.

Trick I am currently doing is to prune older messages to keep it under the limit.


There's the token limit-- the maximum number of tokens to respond.

There's also the token context: how many words into the "past" it considers when formulating the next word.

They're different things. You can generate very, very long responses with a model with a short context window; it just will have amnesia about what it said earlier-- though OpenAI often seems to restrict you / prevent you from having context scroll off in this way.


> Once the limit is hit it will stop, sometimes mid sentence.

Oh that’s why GPT does that in a long thread.

It makes sense in retrospect it’s the whole conversation not the individual messages.

ChatGPT UX has much to be desired, there should be error messages that communicate this stuff better.


Can you please elaborate on how you prune?


you can just use the API, if you set completions to 0 it will return the token count. Then you can just remove the oldest message until it's under any number. I picked 3k to allow 1k for the reply


You can also use the `tiktoken` python library:

     import tiktoken
     len(tiktoken.encoding_for_model("gpt-4").encode(contents))


I've now got access to GPT-4-0314, anybody know the difference between that and the 32k model in this post?


Yours is a snapshot of the 8k token context taken on March 14th.


Does the May 3rd release use the 32k token model?


No, they would call that out specifically in the model name. It’s just a further snapshot so you don’t have to jump straight to the next finetuned version without testing your app.


8k vs 32k context.


I thought it was already available for a while on Microsoft Azure?


Yep, if you sign up via Azure OpenAI Service, you might get access sooner. Same exact API, just served directly through Azure, and likely to be maintained for longer.

Direct link to the relevant sign up form fwiw:

https://customervoice.microsoft.com/Pages/ResponsePage.aspx?...

We got access a few weeks back via Azure.


Will this be better than LoRA? 32k seems like a lot.


This is completely separate from LoRA. This is how much stuff you can give it in the prompt. You can give it now whole chapters of books to summarize for example.

LoRA is for adapting the model to a certain model. It usually means you need to give it shorter prompts, but for book summarization, it wouldn't help.


LoRA probably does not affect the models biggest bottle neck. The attention mechanism. Original transformer was O(n^2d) where n is the query length and d was the cardinal of all the tokens


LoRA doesn’t change the context length. Also you can’t run LoRAs on GPT-4. So those are not relevant to each other.


The distinction becomes less meaningful if you can hold large numbers of tokens in attention, doesn’t it?


I don't... think so. One of us is very confused.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: