I don't know if I would call it a "language" (which to me implies grammar and some level of composability), but I suspect most people have these hidden subconscious languages too.
Think about the "bouba"/"kiki" effect [1], or how the spells in Harry Potter sound recognizably "spell-like" even though spells aren't a real thing. (In the latter case, it's because they're phonetically Latin-adjacent.)
For example, here are some nonsense terms:
* Swith'eil Aerveid
* Karflon B43
* Hooboo skramp
If I were to ask you all which is a dangerous gas, a sex act, and an Elven city, I suspect there would be fairly wide agreement. Inventing words that are evocative this way is a fundamental part of fiction writing, especially in fantasy, horror, and sci-fi. There's nothing profound going on here, it's just that language recognition is fuzzy and highly associative even at the level of individual phonemes.
Immediately after realizing which term is which, I can't help but try to use them all in a sentence together, like "Those damn kids in Swith'eil Aerveid just sit around all day hooboo skramping and smoking their Karflon B43", which I think makes it even more clear how different the words "feel" from each other.
> There was me. That is, Alex, and my three droogs. That is, Pete, Georgie and Dim. And we sat in the Korova Milk Bar trying to make up our rassoodocks what to do with the evening. The Korova Milk Bar sold milk plus. Milk plus vellocet or synthemesc or drencrom which is what we were drinking. This would sharpen you up and make you ready for a bit of the old ultra-violence.
Swith'eil Aerveid, which I took to be the Elven city, immediately made me think of Morrowind. I think it's a perfectly fitting name for a Dunmer or Dunmer city. Maybe a dwarven ruin.
But the ancients released Karflon B43 into the the woods beyond, mutating the monsters therein. Plenty of coin to be made in clearing the elven lands of these fell beasts, to those brave or foolish enough to try.
Also, there is evidence that there might be some hidden meta-language on humans as well from studies on how bilingual people/polyglots speak, code-switch, translate, etc.
My native language don't have gender-neutral personal pronouns.
If I want to use pronouns correctly in English, I have to make mental pause before writing he or she to figure it out. Woman/man distinction does not come automatically.
When I use or think in English, my brain has weird rules of its own. Some person can have his/her pronoun constantly wrong. For some it's random. Context and rhythm can also can create pattern.
It seems obvious to me that there exists something like this shared across all homo sapiens. What we see now as complex language allowing for the broadcast of abstract thoughts and memes, as well as human culture, must both have evolved alongside each other a period of time that would appear long to us, but is meaningless in terms of genetic drift and "meat evolution". But in order for that process to begin, there must have been an extremely basic structure on which to begin it, which by definition must be in our DNA.
Hard disagree. Seems obvious to me those associations are cultural and extrapolations of previous experiences and stories. Ask these 3 prompts to isolated North Koreans or Papuans (probably after explaining elves are supernatural humanoids from Germanic folklore associated with woods and pointy ears) and I'd bet the house it's gonna be a random distribution.
Involving genetics and evolution in etymology makes no sense.
Yeah, this is pretty silly. Following and recognizing phonetic patterns from science / fantasy fiction naming tropes isn't a "secret language."
Ask a person off the street to come up with an "alien planet in a science fiction TV series" or "deadly gas/virus from a comic book" and you'll get similar answers.
Your average person on the street could probably successfully come up with trope-fitting "evil alien species" and "peaceful alien species" names, too.
And if you asked the average person off a street somewhere with minimal exposure to English and Anglo-American culture to come up with names, and they'd probably come up with something completely different derived from their own cultural tropes and neighbouring foreign languages.
With comparative linguistics, almost everything has an exception.
But when perhaps 90% or more of languages do it a certain way on every continent, one does wonder if our thinking tends a certain way. General word order is another an example like that. The indirect object, or patient, or similar category usually is not the first word in a phrase or sentence. "The dog I see" for "I see the dog". "The money him the programmer gave." Like Yoda's species. Only 1 - 2% of languages do it that way consistently, and usually for recent grammatical development reasons, not a long-preserved feature.
Similar story in phonetics. Nearly all languages have at least three stop consonants, usually p, t, k. Closing against the lips, teeth and back of the mouth in the (perhaps obvious?) positions in the mouth to make stop consonants. A few languages only have two of these positions. Some have four or even five, using the throat and using the palate, too. But >95% of them, all over the place, have at least the p, t, k stops. We aren't hardwired, I suspect, with that particular set of noises like we are with crying as a baby. You have to learn them all. It is cultural. Yet almost every culture has converged on at least a somewhat-similar subset. And a few have not.
There are multiple ways to interpret such results, but I lean to something like convergent and parallel cultural evolution.
> There are multiple ways to interpret such results, but I lean to something like convergent and parallel cultural evolution.
So far as I'm aware, that's a very fringe idea in linguistics, and pretty firmly rejected, despite many investigations.
Rather, what you're seeing is propogation from older language influences. Things like Proto-Indo-European have very, very widespread influence - but are still very much not universal.
The pattern I'm talking about shows up across language families. Japanese, Tamil, Quecha, Latin -- are not known to be related but tend to Subject-Object-Verb. English, Chinese, Indonesian, Arabic, Igbo are all Subject-Verb-Object. None of these languages are in families known to be related (except English and Latin). Some 90% of languages in hundreds of unrelated language families are SVO or SOV in basic word order. Most of the remainder are VSO. Object first is extremely rare everywhere. It's a conjecture that, perhaps, all, or most, of the world's language families are related if you go far back enough, and that proto-World was SOV or SVO, but that has not been demonstrated.
So there's no consensus on why these patterns have these tendencies. The sample size is quite large. There's several hundred documented, not-known-to-be-related language families, and language isolates, that must have been separated historically from each other for at least a couple thousand years (or their relation would be pretty obvious). It does look like the tendencies mentioned are retained for a long time in separate populations despite language change, for some reason. That it might be because it relates to how we think is speculation, admittedly.
No, not at all striking. [0] Merely a reflection from the older influences. Many, many languages that are not grouped in the same families, _are_ descended from PIE, and its children. That is to say, they may not traditional grouped in the same family, but would probably still be considered to be in the Indo-European family.
For an example of this "same family" thing, Japanese, Tamil, Quecha, are all probably descended from Proto-Uralic [1], despite not traditionally being considered sister languages.
No. Quecha is spoken in Peru and Ecuador, in South America. Sami is spoken in northern Scandinavia. Tamil is spoken in Southern India. Japanese is spoken in Japan. They would have had to have spread so far in the past that any systematic relationship becomes opaque to us, linguistically. A valid and compelling demonstration of their relationship would be mind-blowing. It would link the indigenous cultures and peoples of southern Asia, northern Europe and South America culturally, with common origin likely within the last ~10,000 years (about the event horizon of the reconstructive method). I've never heard of that claim made seriously, have you? The only language family known to exist in both Asia and America is the Yeneisian-Dene family [1], the existence of which is still a bit controversial, and which is spoken in northern Siberia and North America (which is more like what you'd expect -- though notably, Navajo is in Arizona).
If we're just going to assume on very loose basis from typological comparison, we might as well assume proto-World, because that's what the sum evidence suggests. But that guess (an idea I do take seriously) is very different from demonstrating their relationship through solid comparative and historical linguistics.
> I've never heard of that claim made seriously, have you? The only language family known to exist in both Asia and America is the Yeneisian-Dene family
I assume, then, that you've never heard of the Borean languages [0]. Which link the people of southern Asia, northern Europe and South America, culturally, with a common origin around 40-45,000 years ago. The Borean hypothesis is a claim that is made quite seriously, but does not presuppose a universal origin. It is wide reaching, but there are people groups outside of it.
Yes, I have heard of it. Borean effectively is a superset of Altaic or the even more controversial Nostratic. The languages of Iberia 3000 years ago, Ancient Egyptian, classical Mayan, Chinese -- all probably related? This is not generally accepted.
It's an interesting idea, I cannot disprove it. But I don't think it has been proven, either. Even the possible relationship between Uralic, Afroasiatic and Indo-European is not widely accepted yet. Those language families are extensively documented and we can reconstruct their proto-languages convincingly back to somewhat around 10,000 years ago. They look kind of similar in some ways, maybe areal effects? I think to hope for another 10,000 years further back is too much. Borean is a claim about probably further back than even that. The Nostratic and Altaic subsets of the Borean hypothesis, presumably with its proto-language somewhere in Asia around 10,000 years ago, alone is controversial and is not generally accepted.
My gripe is not that there is some convergent etymology, it's attributing it to genetics.
I think a phycist would feel similarly if it was attributed to "human energy" or something like that.
Sure almost all languages share consonants, but there are only so many sounds our mouths can make that has nothing surprising. Yet there are exceptions like Dahalo which is a language made of clicks.
Sure there are similar words from different roots. Say in the 3 languages I speak, mom/maman/mama, dad/papa/tata, no/non/ne. But there's no reason to involve genetics. Some sounds are easier to make for babies/toddlers, it makes sense they would converge toward for universal words like parent appellations.
Protein synthesis doesn't have to play a part in it no matter how magical genetics may seem.
Bit late to answer, so if you read this, sorry. I must've missed your last paragraph which makes it obvious we are thinking about it similarly. I latched onto genetics because it's why I got involved in the thread.
Agreed, without this shared basic structure from birth we would never be able to make sense of any of our sensory inputs of sight/sound/etc, let alone learn a language. Being born with a "clean slate" mind would make interacting with the universe akin to interacting with a black box.
Everybody even has a non-verbal language they mostly are unaware of. It uses feelings of specific meanings, not words. You'll see the next time you know exactly what you wan to say but can't recall a proper word in any of the languages you speak.
Yeah, that reminded me of Rorschach test and other projective tests. That thread was an example that shows that people benefit from general education and erudition… and that people need a bit more humility to not put forward grandiose claims.
Yep! I’m sure Rowling did that intentionally. By the way, Haaretz says:
“Abracadabra belongs to Aramaic, a Semitic language that shares many of the same grammar rules as Hebrew, says Cohen in Win the Crowd.
‘Abra' is the Aramaic equivalent of the Hebrew 'avra,' meaning, 'I will create. ' While 'cadabra' is the Aramaic equivalent of the Hebrew 'kedoobar,' meaning 'as was spoken.”
Most tellers aren't Jewish. Ability to apply similarities does not imply implication, and says more about the latent racism of the folks making the anti-semetic accusations. As if being Jewish is all about having pointy ears and a nose that is as hooked as a parrot feather? What the fuck.
Yeah, that Harry Potter spells often sound like something else. Leviosa is very close to levitate, for instance. So it's not surprising that many of them seem correct.
The tweet is in response to a preliminary paper [1] [2] studying text found in images generated by, e.g., "Two whales talking about food, with
subtitles." DALL-E doesn't generate meaningful text strings in the images, but if you feed the gibberish text it produces -- "Wa ch zod ahaakes rea." back into the system as a prompt, you would get semantically meaningful images, e.g., pictures of fish and shrimp.
I think the tweeter is being a bit too pedantic. Personally I spent some time thinking about embeddings, manifolds, the structure of language, scientific naming, and what the decoding of the points near the center of clusters in embedding spaces look like (archetypes), after seeing this paper. I think making networks and asking them to explain themselves using their own capabilities is a wonderful idea that will turn out to be a fruitful area of research in its own right.
If DALL-E had a choice to output "Command not understood", maybe we wouldn't be discussing this.
Like those AIs that guess what you draw, and recognize random doodling as "clouds", DALL-E is probably using the least unlikely route. That a gibberish word is drawn as a bird is maybe because it was "bird (2%), goat (1%), radish (1%)".
That's extremely optimisic. When faced with gibberish, the "confidences" are routinely 90%+ as with "meaningful" input.
It's almost as-if its an illusion designed to fool, we, the users.. by only providing inputs meaningful to us, we come to the foolish idea that it understands these inputs.
This is a good point. The fact that DALL-E will try to render something, no matter how meaningless the input, is a trait it has in common with many neural networks. If you want to use them for actual work, they should be able to fail rather than freestyle.
Especially since his results confirm most of what the original thread claimed. A couple of the inputs did not reliably replicate, but "for the most part, they're not true" seems straightforwardly false. He even seems to deliberately ignore this sometimes, such as when he says "I don't see any bugs" when there is very obviously a bug in the beak of all but two or three of the birds.
When I zoomed in, I felt only four in ten birds clearly had anything in their beaks, and in each case it looked like vegetable matter. In the original set, only one clearly has an insect in its beak.
Not really, he afterwards says that he was more trying to inject some humility. He really doesn't think this is measuring anything of interest. For the birds result in particular, see https://twitter.com/BarneyFlames/status/1531736708903051265.
If I read what that tweet says properly, the system ended up outputting things that were almost scientific nomenclature for the general class of items it was being asked to draw. There are probably many examples of "bird is an instance of class X" in the text but they are not consistent, and the resulting token vector is a point near the center of "birdspace".
Yes. Indeed, it seems to interpret a lot of nonsense tokens it doesn't recognize as though it's probably the Latin / scientific term for some sort of species it doesn't remember very well (keeping in mind that all these systems are attempting to compress a large corpus into a relatively small space). I think https://twitter.com/realmeatyhuman/status/153173904648934195... is best illustrative of this phenomenon.
So, it's certainly an "interesting" result in the sense that it shows how these kinds of systems work, but it's definitely not a language.
Why is it important if it's "a language" or not? What we're talking about are concept representations (nouns), not languages. But I think most people who read "DALL-E has a secret language" probably picked up on that because we're accustomed to the hype in machine learning naming things to sound like they are more profound and powerful than they really are.
It's important if it's a "language" because the original thread claimed that it was one (and indeed, a number of comments in responses to this article are still making that claim). You may argue that discovering how DALL-E tries to map nonsense words to nouns is independently interesting, and that's fine (I don't find it interesting personally though--considering it has to pick something, and the evidence that these spaces are not particularly robust when confronted with far out of sample input, I don't even think calling it a secret vocabulary would be accurate), but the authors should reasonably expect some pushback if they argue that this is linguistics.
> asking [neural networks] to explain themselves using their own capabilities
Exactly. This could be profound. I'm looking forward to further work here. Sure, the examples here are daft, but developing this approach could be like understanding a talking lion [0] only this time it's a lion of our making.
I think it’s more likely we can train two neural networks, one to make the decision and one to take the same inputs (or the same inputs plus the output from the first one) and generate plausible language to explain the first. This seems to correspond to what we dimwits consciousness and frankly I would doubt one system can accurately explain its own mechanism. People surely can’t.
It’s a fruitful area of research for sure, but there is a huge gap between “it invented pig Latin” and “it invented Esperanto/Lojban”. Referring to the first as inventing a language is very misleading.
Given that DALL-E is a giant matrix multiplication that correlates fuzzy concepts in text to fuzzy concepts in images, wouldn't one expect that there will be hotspots of nonsensical (to us) correlations, eg between "apoploe vesrreaitais" and "bird"? Intuitively feels like an aspect of the no free lunch theorem.
Exactly this. At a high level, DALL-E is mapping text to a (continuous) matrix and then mapping that matrix to an image (another a matrix). All text inputs will map to _something_. DALL-E doesn't care if that mapping makes sense, it has been trained to produce high-quality outputs, not to ensure the validity of mappings.
None of this makes DALL-E any less impressive to me. High quality image generation is a truly amazing result. Results from foundational models (GPT-3, PaLM, DALL-E, etc) are so impressive that they're forcing us to reconsider the nature of intelligence and raise the bar. That's a sign of a job well done to me.
> it has been trained to produce high-quality outputs, not to ensure the validity of mappings.
Just nitpicking here a bit. It has been trained to ensure the validity of mappings, but only for mappings of valid prompts, where "valid" is vaguely "things that appear in the training set". On the other hand, it wasn't trained to ensure the validity of mappings of invalid prompts to images. It's an open question what that would even mean - someone in another thread here suggested it should output "not a valid prompt" in this case.
But if it's just mapping text to image then it would be fair to assume that using the same text would result in the same image. But does that actually happen?
As much as people would like there to be, there really does not seem to be anything here. The original author doesn't think so, either (would need to refind the tweet).
I took the previous user's question as asking more like: Say there's an image in the training set with the description "Salvador Dali sitting at desk with pen and paper, 1957", and you put in the prompt "Salvador Dali sitting at desk with pen and paper, 1957", how close would the result be to that original training image?
It doesn’t have to be the same to achieve low loss. The date is not relevant, for example, for the image content. It can memorize a few images, but you can’t “compress” all of the internet’s images into text (if you can, make a startup with the most efficient compression algorithm ever made).
It makes sense that it would have weird connections but the big claim here is that it outputs those connections as rendered text despite failing to output actual text is was trained on and prompted with. That sounds very unexpected to me and requiring a lot of evidence (that would be easy to cherry pick), though this debunking wasn’t convincing either.
Yeah. The problem here is that the network only has room for concepts, and hasn't been trained to see meaningless crap. Nor does it really have any way to respond with "This isn't a sentence I know", it just has to come up with an image that best matches whatever prompt it has been fed.
This just feels like one of these topics where you'd really want a liguist. Someone who really understands the construction and evolution of langauge to observe some of the underlying reasons for why language is constructed the way it is. Because I guess that's what DALL-E partly is, it's trying to approximate that, and the interesting thing would be where it differs from real language, rather than matches it. If I give it a made up word that looks like the latin phrase that looks like a species of bird, then it working like I've given it a latin phrase that is a species of bird is pretty reasonable. If you said to me "Homo heidelbergensis" I wouldn't know that was a species of pre-historic human, but I would feel pretty comfortable making that kind of leap.
I also think you could probably hire a team of linguists pretty cheap compared to a team of AI engineers.
I don't think that this related to language, at all. First, let's ask, is there a way for DALL-E to refuse an output (as in, this makes no sense). Then, what would we expect the output for gibberish to be like? Isn't this still subject to filtering for best "clarity" and best signals? While I don't think that these are collisions in the traditional sense of a hash collision, any input must produce a signal, as there is no null path, and what we see is sort of a result of "collisions" with "legitimate" paths. Still, this may tell us some about the inner structure.
Also, there is no way for vocabulary to exist on its own without grammar, as these are two sides of the phenomenon, we call language. Some signs of grammar had to emerge together with this, at once. However…
----
Edit: Let's imagine a typical movie scene. Our nondescript individual points at himself and utters "Atuk" (yes, Ringo Starr!) and then points at his counterpart in this conversation, who utters "Carl Benjamin von Richterslohe". This involves quite an elaborate system of grammar, where we already know that we're asking for a designator, that this is not the designator for the act of pointing, and that by decidedly pointing at a specific object, we'd ask for a specific designator not a general one. Then, C.B. von Richterslohe, our fearless explorer, waves his hand over the backdrop of the jungle, asking for "blittiri" in an attempt to verify that this means "bird", for which Atuk readily points out a monkey. – While only nouns have been exchanged, there's a ton of grammar in this.
And we haven't even arrived at things like, "a monkey sitting at the foot of a tree". Which is mostly about the horizontal and vertical axes of grammar, along which we align things and where we can substitute one thing for another in a specific position, which ultimately provides them with meaning (by what combinations and substitutions are legitimate ones and which are not).
Now, in light of this, that specific compounds are changing their alleged "meaning" radically, when aligned (composed), doesn't allow for high hopes for this to be language.
> Now, in light of this, that specific compounds are changing their alleged "meaning" radically, when aligned (composed), doesn't allow for high hopes for this to be language.
To clarify: that these triggers do produce (radically) different results when provided in varying compositional context, like "as cartoon", "as painting", etc, (i.e, "a as b") suggests that these are just random alliterations that serve merely by accident as synonyms, rather than having a specific value in that position. The latter being a requirement for language. (And it wouldn't be too far fetched to supposed that these are just some "residual" values, when showing up in pseudo-textual compositions. As there had to be some trigger for that.)
I was thinking about a system for pulling data from verbal nonsense the other day, speaking in tongues or something similar. I can create a bunch of noises that lack obvious meaning for me, but obviously they have some meaning that can be learned since humans are terrible at being truly random (lol XD).
I wonder what level I would be able to share ideas I lack the words for, my perceived bitrate at creating "random" noise is certainly higher than when verbally communicating an idea to another human. Will we even share a common language in the future? Or will we have our own language that is translated to other people?
Well, I can only answer with kind of a pun. With Wittgenstein, language is a constant conversation about the extent of the world, about what is and what is not. As such, it is necessarily shared. In the tractatus we find,
> 5.62 (…) For what the solipsist means is quite correct; only it cannot be said, but makes itself manifest.
The world is my world: this is manifest in the fact that the limits of language (of that language which alone I understand) mean the limits of my world. [1]
So, something could become apparent, but you would still haven't said anything (as it's not part of that conversation). ;-)
I believe the original claim has substance behind and it was a very interesting non-trivial observation.
Also, it proposes a very exciting idea for an emergent phenomena that, if understood, could have deep consequences to our understanding of knowledge, language and a lot of other related topics.
So his argument is that the text clearly maps to concepts in the latent space, but when composing them the results are unexpected, so it isn't language? Why isn't this better described as 'the rules of composition are unknown'?
But don't we already know that composition exists in DALL-E? Don't the points shown in the tweet indicate that some form of composition exists? The 3D renders are clearly render-like, the painting and cartoons are clearly in the appropriate style.
"That there exist rules of composition of the hypothesized secret DALL-E language" is a much stronger claim than that it "understands" composition of text in the real languages it was trained on.
Though I'll also point out that even evidence for that weaker claim is tenuous. It definitely knows how to move an image closer to "3D render" in concept-space, but it doesn't seem to understand the linguistic composition of your request. For example, you'd have an extremely hard time getting it to generate an image of a person using 3D rendering software, or a "person in any style that isn't 3D render"; it would probably just make 3D renders of persons.
I haven't played around with it myself, I'm going off the experiences of others. For example:
These conversations so routinely devolve into crowdsourced attempts to define notoriously tricky words like “language”, and “intelligence”.
These absurdly big, semi-supervised transformers are predicting what the next pixel or word or Atari move is. They’re strikingly good at it. To accomplish this they build up a latent space where all the pictures of sunglasses and the word “shades” are cosine similar, and quite different to “dog” or a picture of a dog, and have an operator (in word2vec, addition, in DALL-E, something nonlinear) that can put sunglasses on a dog.
Is that latent space and all the embeddings into it a “language”? Who cares? It works and it’s fucking cool.
It acts like a reverse Rorschach test, where they give you a nonsensical picture and ask for a forced caption from the subject. If you set the task to generate something no matter what, you get something no matter what.
It is trivial to make it reject gibberish prompts. Just use a generative model to estimate the probability of the input, it's what language models do by definition.
Is this and a previous tweet a ML-guys discussion? My layman understanding of neural networks is that the core operation is you basically kick a figure down the hill and see where it ends up, but both the figure and the hill are N-dimensional objects, where N is too huge to comprehend. Of course some nonsensical figures end up at valid locations, but can you really expect some stable inner structure of the hill-figure interaction? I think it’s unlikely that there is a place in a learning method to produce one. NNs can give interesting results, but they don’t magically rewrite their own design yet.
Would still be interesting to see how the output changes with little changes to these inputs. If my vague understanding is at all close, this will reveal the “faces” that are more “noisy” than the others. Not sure what that gives though.
The tweet is wrong and this is important. There’s a difference between DALL-E and DALL-E 2. This phenomenon is intuitive if you know how diffusion models work (DALL-E 2)
1. Tokenized text embeddings can map to similar points in latent space. The text encoder is autoregressive so this won’t work for all random sequence of tokens but it can work for the right ones. I wonder if anyone has tried reverse decoding the embeddings of interest to see if they cluster around known words that are relevant.
2. Diffusion models are trained by pushing off manifold points onto the manifold so to speak. It is not surprising that off manifold points map onto known concepts during the reversal process.
Too pedantic - still obviously something interesting going on here and I don't find myself convinced otherwise just because the original claim isn't as clean as initially presented.
IMO these words were part of some training images (e.g. taken from nature atlases) and DALL-E learned to associate them with birds, although in gibberish form.
There's some form of language here... the correlations are evidence enough. The Grammar I believe is complex and likely not human grammar thus certain words when paired with other words can negate the meaning of the word all together or even completely change it.
For example "hedge" combined with "hog" is neither a "hedge" nor is it a "hog" nor is it some sort of horrific hybrid mixture of hedges and hogs. A hedgehog is tiny rodent. Most likely this is what's going on here.
The domain is almost infinite. And the range is even greater. Thus it's actually realistic to say that there are must be hundreds of input and output sets that form alternative languages.
I don't think there is any evidence of a language here unless you stress the definition to the point of absurdity. It will not even reliably produce the same kinds of images that had the text that it output, which was the original premise of the claim. Obviously, probing some overconstrained high dimensional space where it's never rewarded for uncertainty has to produce something; that doesn't mean that something is a language.
> Obviously, probing some overconstrained high dimensional space where it's never rewarded for uncertainty has to produce something; that doesn't mean that something is a language.
But if one of those probes towards unrewarded input produces a correlation then SOME side effect is influencing it. It means there can be side effects ALL over the unrewarded space.
That being said the reward space VS. unrewarded space is tiny. 1 over infinite for all intents and purposes. It's basically all combinations of letters in reality vs. all possible grammatically correct English combinations.
The unrewarded space is massively huge. Within that unrewarded space given how massive it it is, there is actually a very high probability that there is at least several sets of inputs in there that form a grammar and a consistent language with consistent outputs.
But these sets are hard to find you can't just pick anything. If you pick one secret word that has a correlation with birds, then you mash it up with english expecting it to stay coherent... well that's simply an invalid set.
This could actually be an interesting project. Some algorithm that explores this space attempting to map out connections. It would have to be another ML algorithm, but likely that search is still never ending; but like the library of babel, in terms of probability something must be out there that works.
> The unrewarded space is massively huge. Within that unrewarded space given how massive it it is, there is actually a very high probability that there is at least several sets of inputs in there that form a grammar and a consistent language with consistent outputs.
I think this is a large logical leap. The unrewarded space may be huge, but it is not large enough that it's almost guaranteed (or even close) that we'll find something that looks like language in there. If we did find something that looked like language, even in the unrewarded space, it would be very surprising, which is why the initial post that inspired this response was so talked about! But we have not.
Given the size of the space and the fact that we already found one word it indicates that the probability is high enough that tons of other things like languages exist in that space.
It's like if I gave you a lottery ticket and you won. The lottery ticket is more likely to be rigged then you actually winning. Or in other words, the fact that we even found a secret word is indicative that there's a lot going on in the unrewarded space.
The word wasn't selected randomly, though, so it doesn't tell us much about the broader unrewarded space. And there's a fair amount of evidence that (1) the tokenization of the word includes a bunch of syllables that correspond to the Latin names of birds, and (2) this same technique doesn't work with other reproduced nonsense text (i.e. using the nonsense text produced when asking about workers working in offices gives you more nature scenes, not offices). This makes it difficult to argue that what is going on here is even a secret vocabulary, let alone a secret language.
That Twitter thread made me MORE of a believer that Dall-E has a language its own. As others said, seems like the argument is more about defining "language".
it's not about defining language, it's about refuting the original claim, which was that piping these symbols _back in_ could trigger similar semantic categories - which it clearly can't.
A bunch of people didn't read the original study and just saw the pictures and assumed the gibberish is the only result being discussed.
I think this guy is being a bit pedantic. It is returning semi-consistent results for gibberish, which is interesting. Thats all the original poster meant.
I do not believe AI claims whenever I read them, and this is happening at the risk of me being a cynic and a disbeliever in the field. And I am not sure if that's bad for society or bad just for me.
I am more likely to believe celebrity gossip than AI news articles.
This is one of my favorite topics in all of AI. It was the most surprising and mysterious discovery for me.
The answer is that the training process literally has to make the results smooth. That’s how training works.
Imagine you have 100 photos. Your job is to classify them by color. You can place them however you want, but similar colors should be physically closer together.
You can imagine the result would look a lot like a photoshop RGB picker, which is smooth.
The surprise is, this works for any kind of input. Even text paired with images.
The key is the loss function (a horrible name). In the color picker example, the loss function would be how similar two colors are. In the text to image example, it’s how dissimilar the input examples are from each other (Contrastive Loss). The brilliance of that is, pushing dissimilar pairs apart is the same thing as pulling similar pairs together, when you train for a long time on millions of examples. Electrons are all trying to push each other apart, but your body is still smooth.
The reason it’s brilliant is because it’s far easier to measure dissimilar pairs than to come up with a good way of judging “does this text describe this image?” — you definitely know that it isn’t a bicycle, but you might not know whether a car is a corvette or a Tesla. But both the corvette and the Tesla will be pushed away from text that says it’s a bicycle, and toward text that says it’s a car.
That means for a well-trained model, the input by definition is smooth with respect to the output, the same way that a small change in {latitude,longitude} in real life has a small change in the cultural difference of a given region of the world.
It doesn’t exist. The above explanation is the result of me spending almost all of my time immersing myself in ML for the last three years.
gwern helped too. He has an intuition for ML that I’m still jealous of.
Your best bet is to just start building things and worry about explanations later. It’s not far from the truth to say that even the most detailed explanation is still a longform way of saying “we don’t really know.” Some people get upset and refuse to believe that fundamental truth, but I’ve always been along for the ride more than the destination.
It’s never been easier to dive in. I’ve always wanted to write detailed guides on how to start, and how to navigate the AI space, but somehow I wound up writing an ML fanfic instead: https://blog.gpt4.org/jaxtpu
(Fun fact: my blog runs on a TPU.)
I’m increasingly of the belief that all you need is a strong desire to create things, and some resources to play with. If you have both of those, it’s just a matter of time — especially putting in the time.
That link explains how to get the resources. But I can’t help with how to get a desire to create things with ML. Mine was just a fascination with how strange computers can be when you wire them up with a small dose of calculus that I didn’t bother trying to understand until two years after I started.
(If you mean contrastive loss specifically, https://openai.com/blog/clip/ is decent. But it’s just a droplet in the pond of all the wonderful things there are to learn about ML.)
IMO the term "cost function" is much more intuitive than "loss function" - it tells you the cost, which it attempts to minimize by some iterative process (in this case training)
I actually completely lost interest once I found this out. Simply taking some ML course like the old Andrew Ng courses online are enough for you to get the general idea.
ML is simply curve fitting. It's a applied math problem that's quite common. In fact I lost a lot of interest in intelligence in general once I realized this was all that was going on. The implications really say that all of intelligence is really some form of curve fitting.
The simplest form of this is linear regression which is used to derive an equation for a line from a set of 2D points. All ML is basically a 10,000 (or much more) dimensional extension of that. The magic is lost.
Most of ML research is just to find the most efficient way to find the best fitting curve given the least amount of data points. A ML guys knowledge is centered around a bunch of tricks and techniques to achieve that goal with some N-D template equation. And the general template equation is all the same: A neural network. The answer to what intelligence is seems to be quite simple and not that profound at all... which makes sense given that we're able to create things like DALL-E in such a short time frame.
One of the big mysteries of the universe (intelligence) and the thing I always wondered about was essentially answered within the last 2 decades which is pretty cool.. but it's like knowing the secret behind an amazing magic trick.
But for me, that means it’s far more interesting than AGI. Everyone has their eye on AGI, and no one seems to be taking ML at face value. That means the first companies to do it will stand to make a fortune.
Why do people use analogies to prove a point? It doesn't prove anything.
What was your point here? ML is like a guitar? What you said doesn't seem to contradict anything I said other then you find curve fitting interesting and I don't.
Not trying to be offensive here, don't take it the wrong way.
On a side note your music example also basically destroys the question of what is music? Well the answer to that question is that music is the set of all points on some sort of N-dimensional curve. The profoundness of the question is completely gone.
This is largely a pointless semantic debate, but to risk wasting my time: transformers specifically in LLMs are doing more than curve fitting as there can never be enough training data to naively interpolate between training examples to construct the space of semantically valid text strings. To find meaningful sentences between training examples, the intrinsic regularity of the underlying distribution must be modeled. This is different from merely curve fitting. To drive this point home, some examinations of transformer behavior in LLMs show emergent structures that capture arbitrary patterns in the input and utilize them in constructing the output. This is not merely curve fitting.
Bro. The basic fundamental idea of Computer logic has always been trivial even without understanding of binary. There are tons of mechanisms outside of binary that can mimic logic, there's was never anything mysterious here. Understanding boolean logic and architecture is not a far leap from an intuitive understanding of how computers work.
Human thought and and human intelligence on the other hand was a great and epic concept on the scale of the origin of the universe. It was truly this mysterious epic thing that seemed like something we would never crack. ML brought it down and completely reduced this concept and simplified it by a massive scale. The entire field is now an extension of this curve fitting concept. And the disappointing thing is that the field is correct. That's all intelligence is in the end.
This is all I mean. Not saying ML is less interesting or easier then any other STEM field. All I'm saying is the reduction was massive. The progress is amazing but 99% of the wonder was lost. The scale at which we lacked understanding was covered in a single step and now the average person can understand the basics easier than they can understand something like quantum mechanics. There's still a lot going on in terms of things to discover and things to engineer, but the fundamentals of what's going on are clearer than ever before.
So I think what happened here is that you mistook what I wrote and took offense as if I was attacking the field, I'm not. I'm writing this to explain to you that you're mistaken.
So dial your aggressive shit back. Is everyone from Romania like you? I certainly hope not.
Usually when people say "ML is just curve fitting" they mean to continue with something like "so it will never be able to compete with humans."
The interesting thing to me about the secret language is that it seems to imply that when DALL-E fit words to concepts, it created extrapolations in its curve fit that are more extreme than the actual training samples, ie. its fit has out-of-domain extrema. So there are letter sequences that are more "a whale on the moon" than the actual text "a whale on the moon." Linguistic superstimulus.
Yes, I can confirm that's how I read the "just curve fitting" bit.
Regarding the gibberish word to image issue - CLIP uses a text transformer trained by contrastive matching to images. That means it's different from GPT, where it trains to predict the probability of the next word. GPT would easily tell apart gibberish words from real words, or incorrect syntax because they would be low probability sequences. CLIP text transformer doesn't do that because of the task formulation, not because of an intrinsic limitation. It's not so mysterious after realising they could have used a different approach to have both the text embedding and filter out gibberish if they wanted.
A good analogy would be a Rorschach test - show an OOD image to a human asking him to caption it. They will still say something about the image, just like DALL-E will draw a fake word. It's because the human is expected to generate a phrase no matter if the image makes sense or not, and DALL-E has a similar demand. The task formulation explains the result.
The mapping from nonsense word to image is explained by the continuous embedding space of the prompt and the ability to generate images from noise of the diffusion model. Any point in the embedding space, even random ones, fall closer to some concepts and further from other concepts. The lucky concept most similar to the random embedding would trigger the image generation.
Usually except I went on to elaborate that curve fitting was essentially what intelligence was. If mr. Genius here read my post more carefully he wouldn't have had to reveal how immature he was with those comments.
It's OBVIOUS what's going on here. When you combine TWO different languages you get stuff that appears as NONSENSE. You have to stay in the same language!
There is for sure a set of consistent words that produce output that makes sense to us. He just picked the wrong set!
DALL-E 2 has a secret language - https://news.ycombinator.com/item?id=31573282 - May 2022 (109 comments)