For anyone wondering about the architecture this is a 1.4 million parameter tran...

rpearl · on Oct 25, 2023

let's see...

English speakers tend to have a vocabulary of 30k words or so depending on what exactly you measure.

GPT-4 has 1.7 trillion parameters.

Of course scaling up is quite unpredictable in what capabilities it gets you, but It's not that much of a stretch to imagine that a GPT-4 sized model would have a reasonably sized vocabulary. Certainly worth testing if you have the resources to train such a thing!

glitchc · on Oct 25, 2023

It's the mapping of parameters to vocabulary claim embedded in your response that needs validation. 1.7 trillion parameters means what exactly?

Let's start with working vocabulary. Working vocabulary doesn't just mean knowing n words. It means putting n words together in factorially many valid combinations to construct sentences. And 30k btw is an insane working vocabulary. Most people know 1000 words on average in English. All of their sentences are structured from those 1000 words. This is true for most cases, except certain ones like Mandarin or German, where basic words can be used to assemble more complex words.

Certainly GPT-4 knows something. Presumably that something can be mapped to a working vocabulary. How large a vocabulary that is requires a testable, reproducible hypothesis supported by experimental proof. Do you have such a hypothesis with proof? Does anyone? Until we do it's just a guess.

_a_a_a_ · on Oct 25, 2023

> Most people know 1000 words on average in English

Maybe I misunderstand but that sounds so stupid and wrong I don't know where to start. Standard vocab estimates are 15K - 30K, averaging about 20K (these from memory)

Edit: from wiki https://en.wikipedia.org/wiki/Vocabulary#Native-language_voc...

---

As a result, estimates vary from 10,000-17,000 word families[16][19] or 17,000-42,000 dictionary words for young adult native speakers of English.[12][17]

A 2016 study shows that 20-year-old English native speakers recognize on average 42,000 lemmas, ranging from 27,100 for the lowest 5% of the population to 51,700 lemmas for the highest 5%. These lemmas come from 6,100 word families in the lowest 5% of the population and 14,900 word families in the highest 5%. 60-year-olds know on average 6,000 lemmas more. [12]

According to another, earlier 1995 study junior-high students would be able to recognize the meanings of about 10,000–12,000 words, whereas for college students this number grows up to about 12,000–17,000 and for elderly adults up to about 17,000 or more.[20]

---

Y_Y · on Oct 25, 2023

Does the average include people who don't speak English? If about 4% of the world's population are native speakers and the number of words known tails off after that I can imagine it could almost be approximately true. And maybe we are counting babies in English majority countries as native speakers but they haven't learned all their 20k words yet. Of course GP's point is still invalid in that case.

debo_ · on Oct 25, 2023

I assumed they missed a 0 and meant 10000 words.

throwaway9274 · on Oct 26, 2023

100% of people know the most common 1000 words. The remainder of those who know more words fall into a consistent curve across languages that follows Zipf’s law. This is different than “most people know 1000 words on average.”

rpearl · on Oct 25, 2023

I don't care to pay for access to gpt-4 but one could easily use one of the vocabulary estimation tests, which use some statistics plus knowledge of word appearance frequency, to estimate its vocabulary size. https://mikeinnes.io/2022/02/26/vocab is one such test which explains the statistics ideas, and there are many others based on similar principles.

swyx · on Oct 25, 2023

> if you have the resources to train such a thing!

estimates i've heard for training gpt4 was $500m all in for what its worth

Izkata · on Oct 25, 2023

Morse code has a vocabulary of 3 words (dot, dash, pause between human words).

IMTDb · on Oct 25, 2023

Then English has a vocabulary of 27 words: A -> Z + space

sterlind · on Oct 25, 2023

I think a fairer comparison would be Toki Pona, a micro-language with ~120 words. you can express lots of things if you have a great deal of patience, Up-Goer Five style.

In logic there's also SKI combinator calculus, which is Turing-complete with three symbols, and unlike the Morse or A-Z examples, but like the Toki Pona example, each symbol has a semantic meaning.

If you just want to describe ideas in an abstract realm like sequences of colors, like this paper, it's not surprising you don't need many words.

nerdponx · on Oct 25, 2023

If you tokenize using character n-grams, this isn't totally wrong.

Izkata · on Oct 25, 2023

Exactly.

There was a post on here a few months ago about training using single characters as tokens instead of words, and it worked really well, being able to create new Shakespeare-like text despite not using human words as tokens. What a (human) word is can be learned by the model instead of encoded in the training set.

yawnxyz · on Oct 25, 2023

those are more like tokens, not words