This is also just eliding the work that humans still have to do.
I give you an audio file (let's say just regular old PCM wav format). You cannot do anything with this without making some decisions about what happens next to the data. For audio, you're very minimally faced with the question of whether to do a transform into the frequency domain. If you don't do that, there's a ton of feature classification that can never be done. No audio to vector model can make that sort of decision for itself - humans have to make the possible.
Raw inputs are suitable for some things, but essentially E5 is just one that has already had a large number of assumptions built into it that happen to give pretty good results. Nevertheless, were you interested for some weird reason in a very strange metric of text similarity, nothing prebuilt, even E5, is going to give that to you. Let's look at what E5 does:
> The primary purpose of embedding models is to convert discrete symbols, such as words, into continuous-valued vectors. These vectors are designed in such a way that similar words or entities have vectors that are close to each other in the vector space, reflecting their semantic similarity.
This is great, but useful for only a particular type of textual similarity consideration.
And oh, what's this:
> However, the innovation doesn’t stop there. To further elevate the model’s performance, supervised fine-tuning was introduced. This involved training the E5 embeddings with labeled data, effectively incorporating human knowledge into the learning process. The outcome was a consistent improvement in performance, making E5 a promising approach for advancing the field of embeddings and natural language understanding.
Hmmm ....
Anyway, my point still stands: choosing how to transform raw data into "features" is a human activity, even if the actual transformation itself is automated.
I agree with your point at the highest (pretrained model architect) level, but tokenization/encoding things into the frequency domain are decisions that typically aren’t made (or thought of) by the model consumer. They’re also not strictly theoretically necessary and are artifacts of current compute limitations.
Btw E5 != E5 Mistral, the latter achieves SOTA performance without any labeled data - all you need is a prompt to generate synthetic data for your particular similarity metric.
> Unlike existing methods that often depend on multi-stage intermediate pre-training with billions of weakly-supervised text pairs, followed by fine-tuning with a few labeled datasets, our method does not require building complex training pipelines or relying on manually collected datasets… We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across nearly 100 languages.
It’s true that ultimately there’s a judgement call (what does “distance” mean?), but I think the original post far overcomplicates what’s standard practice today.
Sorry, I just not believe this generalizes in any meaningful sense for arbitrary data.
You cannot determine frequencies from audio PCM data. If you want to build a vector database of audio, with frequency/frequencies as one of the features, at the very least you will have to arrange for a transform to the frequency domain. Unless you claim that a system is somehow capable of discovering fourier's theorem and implementing the transform for itself, this is a necessary precursor to any system being able to embed using a vector that includes frequency considerations.
But ... that's a human decision because humans think that frequencies are important to their experience of music. A person who totally deaf, however, and thus has extremely limited frequency perception, can (often) still detect rythmic structure due to bone conduction. Such a person who was interested in similarity analysis of audio would have no reason to perform a domain transform, and would be more interested in timing correlations that probably could be fully automated into various models as long as someone remembers to ensure that the system is time-aware which is, again, just another particular human judgement regarding what aspects of the audio have significance.
I just read the E5 Mistral paper. I don't see anything that contradicts my point, which wasn't about the need for human labelling, but about the need for human identification of significant features. In the case of text, because of the way languages evolve, we know that a semantic-free character-based analysis will likely bump into lots of interesting syntactic and semantic features. Doing that for arbitrary data (images, sound, air pressure, temperature) lacks any such pre-existing reason to treat the data in any particular way.
Consider, for example, if the "true meaning" of text was encoded in a somewhat Kaballah-esque type scheme, in which far distance words and even phonemes create tangled loops of reference and meaning. Even a system like E5 Mistral isn't going to find that, because that's absolutely not how we consider language to work, and thus that's not part of the fundamentals of how even E5 Mistral operates.
Understanding audio with inputs in the frequency domain isn’t required for understanding frequencies in audio.
A large enough system with sufficient training data would definitely be able to come up with a Fourier transform (or something resembling one), if encoding it helped the loss go down.
> In computer vision, there has been a similar pattern. Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.
Today’s diffusion models learn representations from raw pixels, without even the concept of convolutions.
Ditto for language - as long as the architecture is 1) capable of modeling long range dependencies and 2) can be scaled reasonably, whether you pass in tokens, individual characters, or raw ASCII bytes is irrelevant. Character based models perform just as well (or better than) token/word level models at a given parameter count/training corpus size - the main reason they aren’t common (yet) is due to memory limitations, not anything fundamental.
I give you an audio file (let's say just regular old PCM wav format). You cannot do anything with this without making some decisions about what happens next to the data. For audio, you're very minimally faced with the question of whether to do a transform into the frequency domain. If you don't do that, there's a ton of feature classification that can never be done. No audio to vector model can make that sort of decision for itself - humans have to make the possible.
Raw inputs are suitable for some things, but essentially E5 is just one that has already had a large number of assumptions built into it that happen to give pretty good results. Nevertheless, were you interested for some weird reason in a very strange metric of text similarity, nothing prebuilt, even E5, is going to give that to you. Let's look at what E5 does:
> The primary purpose of embedding models is to convert discrete symbols, such as words, into continuous-valued vectors. These vectors are designed in such a way that similar words or entities have vectors that are close to each other in the vector space, reflecting their semantic similarity.
This is great, but useful for only a particular type of textual similarity consideration.
And oh, what's this:
> However, the innovation doesn’t stop there. To further elevate the model’s performance, supervised fine-tuning was introduced. This involved training the E5 embeddings with labeled data, effectively incorporating human knowledge into the learning process. The outcome was a consistent improvement in performance, making E5 a promising approach for advancing the field of embeddings and natural language understanding.
Hmmm ....
Anyway, my point still stands: choosing how to transform raw data into "features" is a human activity, even if the actual transformation itself is automated.