This is also just eliding the work that humans still have to do. I give you an a...

yunyu · on Jan 13, 2024

I agree with your point at the highest (pretrained model architect) level, but tokenization/encoding things into the frequency domain are decisions that typically aren’t made (or thought of) by the model consumer. They’re also not strictly theoretically necessary and are artifacts of current compute limitations. Btw E5 != E5 Mistral, the latter achieves SOTA performance without any labeled data - all you need is a prompt to generate synthetic data for your particular similarity metric.

> Unlike existing methods that often depend on multi-stage intermediate pre-training with billions of weakly-supervised text pairs, followed by fine-tuning with a few labeled datasets, our method does not require building complex training pipelines or relying on manually collected datasets… We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across nearly 100 languages.

It’s true that ultimately there’s a judgement call (what does “distance” mean?), but I think the original post far overcomplicates what’s standard practice today.

PaulDavisThe1st · on Jan 13, 2024

Sorry, I just not believe this generalizes in any meaningful sense for arbitrary data.

You cannot determine frequencies from audio PCM data. If you want to build a vector database of audio, with frequency/frequencies as one of the features, at the very least you will have to arrange for a transform to the frequency domain. Unless you claim that a system is somehow capable of discovering fourier's theorem and implementing the transform for itself, this is a necessary precursor to any system being able to embed using a vector that includes frequency considerations.

But ... that's a human decision because humans think that frequencies are important to their experience of music. A person who totally deaf, however, and thus has extremely limited frequency perception, can (often) still detect rythmic structure due to bone conduction. Such a person who was interested in similarity analysis of audio would have no reason to perform a domain transform, and would be more interested in timing correlations that probably could be fully automated into various models as long as someone remembers to ensure that the system is time-aware which is, again, just another particular human judgement regarding what aspects of the audio have significance.

I just read the E5 Mistral paper. I don't see anything that contradicts my point, which wasn't about the need for human labelling, but about the need for human identification of significant features. In the case of text, because of the way languages evolve, we know that a semantic-free character-based analysis will likely bump into lots of interesting syntactic and semantic features. Doing that for arbitrary data (images, sound, air pressure, temperature) lacks any such pre-existing reason to treat the data in any particular way.

Consider, for example, if the "true meaning" of text was encoded in a somewhat Kaballah-esque type scheme, in which far distance words and even phonemes create tangled loops of reference and meaning. Even a system like E5 Mistral isn't going to find that, because that's absolutely not how we consider language to work, and thus that's not part of the fundamentals of how even E5 Mistral operates.

yunyu · on Jan 13, 2024

The above is a common anthropocentric take that has been repeatedly disproven by the last decade of deep learning research: http://www.incompleteideas.net/IncIdeas/BitterLesson.html

Understanding audio with inputs in the frequency domain isn’t required for understanding frequencies in audio.

A large enough system with sufficient training data would definitely be able to come up with a Fourier transform (or something resembling one), if encoding it helped the loss go down.

> In computer vision, there has been a similar pattern. Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.

Today’s diffusion models learn representations from raw pixels, without even the concept of convolutions.

Ditto for language - as long as the architecture is 1) capable of modeling long range dependencies and 2) can be scaled reasonably, whether you pass in tokens, individual characters, or raw ASCII bytes is irrelevant. Character based models perform just as well (or better than) token/word level models at a given parameter count/training corpus size - the main reason they aren’t common (yet) is due to memory limitations, not anything fundamental.

For further reading, I’d recommend literature on transformer circuits for learning arithmetic without axioms: https://www.lesswrong.com/posts/CJsxd8ofLjGFxkmAP/explaining...