Can you point to anyone other than yourself who calls Indonesians black? Because I think otherwise it's not worthwhile discussing categorization and measurement in good faith with you.
Since you're here: have you considered moving to other, better generalist base models in the future? Particularly Deepseek or Mixtrals. Natural language foundation is important for reasoning. Codellama is very much a compromise, it has lost some NLP abilities from continued pretraining on code.
Note that we have no reason to believe that the underlying LLM inference process has suffered any setbacks. Obviously it has generated some logits. But the question is how is OpenAI server configured and what inference optimization tricks they're using.
The operation of this server is very uniform, in my imagination. Just emitting chunks of string. That this can be disrupted and an edge case occur, by the content of the strings - I find it puzzling.
This is not so surprising if you consider the fact that finetuning is extremely sparse and barely imparts any new knowledge to the model. The paper "Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch"[1] made this clear:
> We initially demonstrate that SFT LM (either encoder- or decoder-based) always tends to acquire excessively redundant delta parameters. To be specific, we present DARE, which randomly resets some delta parameters to zeros based on a drop rate p and subsequently scales the remaining parameters by a factor of 1/(1 − p). Despite its simplicity, with the assistance of DARE, when the LM model parameters reach 70 billion, we can eliminate up to 99% delta parameters with minimal impact on model performance (see Figure 1(a)). The more parameters the LM has, the larger p it can tolerate. This discovery suggests that SFT LM indeed learns a multitude of low-rank structures akin to LoRA [25]
Insofar as those adaptations are mostly distinct, you can just preserve both sets and that's what explains successes of merging, I guess.
Mistral-small explicitly has inference costs of a 12.9b, but more than that, it's probably ran with batch size of 32 or higher. They'll worry more about offsetting training costs than about this.
> It's not even close to a 45B model. They trained 8 different fine-tunes on the same base model. This means the 8 models differ only by a couple of layers and share the rest of their layers.
No, Mixture-of-Experts is not stacking finetunes of the same base model.
The original paper by Shazeer suffices. What you are saying is in theory possible to do and may have been done in practice here, but in the general case MoE is trained from scratch and specializations of layers which develop are not products of some design choice.
> today we have no architecture or training methodology which would allow it to be possible.
We clearly see that Mistral-7B is in some important, representative respects (eg coding) superior to Falcon-180B, and superior across the board to stuff like OPT-175B or Bloom-175B.
"Well trained" is relative. Models are, overwhelmingly, functions of their data, not just scale and architecture. Better data allows for yet-unknown performance jumps, and data curation techniques are a closely-guarded secret. I have no doubt that a 7B beating our best 60-70Bs is possible already, eg using something like Phi methods for data and more powerful architectures like some variation of universal transformer.
I mean, I 100% agree size is not everything. You can have a model which is massive but not trained well so it actually performs worse than a smaller, better/more efficiently trained model. That's why we use Llama 2 70b over Falcon-180b, OPT-175b, and Bloom-175b.
I don't know how Mistral performs on codegen specifically, but models which are finetuned for a specific use case can definitely punch above their weight class. As I stated, I'm just talking about the general case.
But so far we don't know of a 7b model (there could be a private one we don't know about) which is able to beat a modern 70b model such as Llama 2 70b. Could one have been created which is able to do that but we simply don't know about it? Yes. Could we apply Phi's technique to 7b models and be able to reach Llama 2 70b levels of performance? Maybe, but I'll believe it when we have a 7b model based on it and a human evaluation study to confirm. It's been months now since the Phi study came out and I haven't heard about any new 7b model being built on it. If it really was such a breakthrough to allow 10x parameter reduction and 100x dataset reduction, it would be dumb for these companies to not pursue it.
Comments like this are incredibly grating. You condescend to the interlocutor for making a mistake which only exists in your own mistaken world model. Your confidence that neurons and ANN weights and «pulleys and gears» are all equivalent because there is, in theory, an intention to instantiate some computation, and to think otherwise is tantamount to belief in magic and broken causality, is just confused and born out of perusing popular-scientific materials instead of relying on scientific literature or hands-on experience.
> The fire because of input.
No they do not fire because of input, they modulate their firing probability based on input, and there are different modalities of input with different effects. Neurons are self-contained biological units (descended, let me remind you, from standalone unicellular organisms, just like the rest of our cells), which actually have an independently developing internal state and even metabolic needs; they are not merely a system of logic gates even if you can approximate their role with a system of equations or an ANN. This is very different, mechanistically and teleologically. Hell, even spiking ANNs would be substantially different from currently dominant models.
> So what, magic ? a soul ? If the brain is computing then the substrate is entirely irrelevant
Stop dumbing down complex arguments to some low-status culture war opinion you find it easy to dunk on.
>Your confidence that neurons and ANN weights and «pulleys and gears» are all equivalent because there is, in theory, an intention to instantiate some computation, and to think otherwise is tantamount to belief in magic and broken causality, is just confused and born out of perusing popular-scientific materials instead of relying on scientific literature or hands-on experience.
Computation is substrate independent. I'm not saying neurons and ANN weights and «pulleys and gears» are the same. I'm saying it does not matter because what you perform computation with does not change the results of the computation. If the brain computes, then it doesn't matter what is doing the computation.
>No they do not fire because of input, they modulate their firing probability based on input, and there are different modalities of input with different effects. Neurons are self-contained biological units (descended, let me remind you, from standalone unicellular organisms, just like the rest of our cells), which actually have an independently developing internal state and even metabolic needs; they are not merely a system of logic gates even if you can approximate their role with a system of equations or an ANN. This is very different, mechanistically and teleologically. Hell, even spiking ANNs would be substantially different from currently dominant models.
Yes, a neuron is firing because of input. To suggest otherwise is to suggest something beyond cause and effect directing the workings of the brain. If that is genuinely not the case then feel free to explain why, rather than an ad hominin attack on someone you don't even know.
> So what, magic ? a soul ? If the brain is computing then the substrate is entirely irrelevant
>Stop dumbing down complex arguments to some low-status culture war opinion you find it easy to dunk on.
I personally don't care if that's what anyone believes. The intention is not to attack anyone.
If you believe in a soul or the non religious equivalent, that's fine. We just have different axioms.
If you don't believe in a soul(or the equivalent) but somehow think substrate matters then you need to explain why because it makes no sense.
I am not well versed in any of this, but from reading the counterarguments, I think two good points are being made:
* Analogies aside, neurons are quite different than NN nodes, because each neuron has an incredibly complex internal cellular state, whereas an NN node just has an integer for state.
* A brain is not a "function" in the way that a trained LLM model is. Human life is not a series of input prompts and output prompts. Rather, we experience a fluid stream of stimuli, which our brain multiplexes and reacts to in a variety of ways (speaking, moving, storing memories, moving our pupils, releasing hormones, etc). That is NOT TO SAY a brain violates causality; it's saying that the brain is mechanically doing so much more than an LLM, even if the LLM is better at raw computation.
None of this IMO precludes AGI from happening in the medium term future, but I do think we should be careful when making comparisons between AGI and the human brains.
Rather than comparing "apples to gorillas", I'd say it's like comparing a calculator to a tree. Yes, the calculator is SIGNIFICANTLY better at multiplication, but that doesn't make it "smarter" than a tree, whatever that means.
I do not even think any of this has much of impact on AGI timelines. Human brain cells are not a superior substrate for computing "intelligence". They just are what they are; individual cells can somewhat meaningfully "want" stuff and be quasi-agents unto themselves, they do much more than integrate and fire. Weights in an ANN are purely terms in an equation without any inner process or content.
> Computation is substrate independent. I'm not saying neurons and ANN weights and «pulleys and gears» are the same. I'm saying it does not matter because what you perform computation with does not change the results of the computation. If the brain computes, then it doesn't matter what is doing the computation.
It's a tautology. If the substrate did change the computation, then it wouldn't be the computation.
Claims where it isn't possible for you to be incorrect may be less impressive than they seem.
It's not a tautology. That you can go out tomorrow and buy a deluge of computers with different hardware and run the same software without change is exactly a demonstration of substrate independence.
And if one of those computers failed, it wouldn't classify as a (proper) computer.
You can move your pointer anywhere you'd like, it is ultimately tautological. Infinite regress is a bitch lol
Say, have you taken into consideration the role consciousness and culture are playing here? Like this "reality" you are describing, do you know what the actual, biological/scientific source of it is? :) But now I'm kind of cheating, aren't I...I think we're not supposed to say that part out loud! ;)
There are views where there is an implicit substrate that exists in another layer of reality, ie. dualism. That layer is generally not counted among the substrate. So a computation can be substrate dependent by introducing a non-material cause. (Disclaimer: I don't personally know anyone who believes anything like this, so this may be a bad paraphrasing.)
It's not dumbing down. It's extracting the crux of the matter that the complexity of arguments is trying to hide, perhaps unintentionally. Either the brain implements a function that can be approximated by a neural network thanks to universal approximation theorem, or the function cannot be approximated (you need arguments for why it is the case), or magic.
This is technically true but kind of misses the point in my opinion. A neural network can approximate any function in theory but that doesn't mean it has to do so in a reasonable amount of time and with a reasonable amount of resources. For example, take the function that gives you the prime factors of an integer. It is theoretically possible for a neural network to approximate this for an arbitrarily large fixed window but is provably infeasible to compute on current hardware. In theory, a quantum computer could compute this much faster.
This is not to say that the human brain leverages quantum effects. It's just a well known example where the hardware and a specific algorithm can be shown to matter.
I also think it's strange to describe the brain as implementing a function. Functions don't exist. We made them up to help us think about building useful circuits (among other things). In this scenario, we would be implementing functions to help us simulate what is going on in brains.
> is just confused and born out of perusing popular-scientific materials instead of relying on scientific literature or hands-on experience.
I suspect there's some fundamental metaphysical framework protection in play here, the sort of language being used is pretty common, I believe it to be a learned cultural behavior (from consuming similar arguments).
…ETH Zurich is an illustrious research university that often cooperates with Deepmind and other hyped groups, they're right there at the frontier too, and have been for a very long time. They don't have massive training runs on their own but pound for pound I'd say they have better papers.
ETH Zurich is one of the top labs in the world. Disney Research also works with them a lot. Another "sleeper" is University of Amsterdam that has rockstars like Max Welling and his students Kingma, Salimans,van den Berg, and Hoogeboom.
It's easy to get hyped up on the big tech labs because they have the most compute, but the best papers come from smaller labs and unfortunately more lately face larger challenges in getting published. It's the smaller works that create the foundations that end up in these giant models. ML is in a really weird space right now.