This is cool, and timely (I wanted a neat repo like that).
I have also been working from last 2 weeks on a gpt implementation in C. Eventually it turned out to be really slow (without CUDA). But it taught me how much memory management and data management there is when implementing these systems. You are running like a loop billions of times so you need to preallocate the computational graph and stuff. If anyone wanna check out it's ~1500 LOC single file:
The next level down is to do it directly in numpy.
And then from there, write a minimal numpy work-a-like to support the model above.
You start with a working system using the most powerful abstractions. Then you iteratively remove abstractions, lowering your solution, then when you get low enough but still riding on an external abstraction, you rewrite that, but ONLY to support the layers above you.
Following the above pattern, you can bootstrap yourself to have full system understanding. This is not unlike RL+distillation that human persons do learn complex topics.
Numpy can use the chipmaker’s BLAS (Intel MKL or AMD’s Blis fork). Trying to replace it could be a good academic exercise but I think most people wisely leave that to the vendors.
Where do you get that? He is postulating the external abstraction you are using has more features than you use. He is saying implement only the parts you use.
Can someone help me understand what I’m looking at here? This repository allows me to train a specific model on a specific data set, and finally test the result? Is that correct?
I am interested in how large and small language models are trained, but as someone who has little knowledge in this world I find it hard to cut through the noise to find useful information.
Really I’m looking for an open source project that helps a person gain this knowledge. Something like a docker container that encapsulates all the dependencies. When training it will use any available gpu or tell me why my gpu can’t be used and then fall back to cpu. Then had a simple interface to test the training results. Finally you can easily pull back the curtain to understand the process in better detail and maybe even adapt it to different model to experiment.
https://course.fast.ai/ is the best. From their site:
"
A free course designed for people with some coding experience, who want to learn how to apply deep learning and machine learning to practical problems.
"
If you are looking for something that actually explains most concepts behind it, Karpathy’s series will teach you what you want to know. (Mentioned elsewhere)
If you are looking for command line tools to fine tune and evaluate models on known datasets, this article is a good take!
As opposed to inference (like generating text and images), training requires some more math (fp16 or bf16) and a single CPU generally won't cut it.
The prepare/train/generate instructions in the github linked are pretty much it for the 'how' of training a model. You give it a task and it does it for 1 billion trillion epochs and saves the changes incrementally (or not).
Training a LoRA for an image model may be more approachable, there's more blog entries etc on this, and the process is largely similar, except you're doing it for a single slice instead of the whole network.
[edit] I'm also learning so correct me if I'm off, hn!
Do you have a good theoretical foundation in ML ? You will also need some linear algebra.
If not would invest the time in a decent course, there are plenty online, even offline if you are close enough to where its offered. I took one from Andrew Ng on Coursera years ago, which used Matlab. There are probably much better, more up-to-date options now, especially now that LLMs are very in-vogue. The fundamentals such as gradient descent, ANNs and back-propagation however, is still relevant, and hasn't changed much.
Trying to understand what code is doing without that foundation will be an exercise in futility.
nanoGPT is awesome (and I highly recommend his videos on it), but it’s closer to a direct reproduction of GPT-2, so it’s cool to have a really clean implementation of some newer ideas.
I have made my own implementation from scratch with my own multi-channel tokeniser, each channel gets its own embedding table 32768, 256,256, 64, and 4. Which are summed along with the position encoding.
Yet with all of those differences, my stories have Lily as a protagonist often enough that I thought I had a bug somewhere.
Might have to check tinystories for name distribution.
Most questionable output from mine so far:
"one day, a naughty man and a little boy went to the park place to find some new things."
Llama2 inference can be implemented in 900-ish lines of dependency-free C89, with no code golfing[1]. More modern architectures (at least the dense, non-MoE models) aren't that much more complicated.
That code is CPU only, uses float32 everywhere and doesn't do any optimizations, so it's not realistically usable for models beyond 100m params, but that's how much it takes to run the core algorithm.
A minimal hardcoded definition of the structure: probably a few hundred lines.
The actual definition, including reusable components, optional features, and flexibility for experimentation: probably a few thousand.
The code needed to train the model, including all the data pipelines and management, training framework, optimization tricks, etc.: tens of thousands.
The whole codebase, including experiments, training/inference monitoring, modules that didn't make it into the final architecture, unit tests, and all custom code written to support everything mentioned so far: hundreds of thousands.
I noticed several people mentioned Karpathy already, but I wanted to include that his tiny "Micrograd" project (see Youtube Video and GitHub) is a great introduction to Neural Nets (Multilayer Peceptron), which is at the core of [most] machine learning of course.
So, this has nothing to do with "SmolLM" - a set of models (with data, training recipes, etc) released by HuggingFace? https://huggingface.co/blog/smollm
The best part is that you can debug and step through it in the browser dev tools: https://youtube.com/watch?v=cXKJJEzIGy4 (100 second demo). Every single step is is in plain vanilla client side JavaScript (even the matrix multiplications). You don't need python, etc. Heck, you don't even have to leave your browser.
I recently did an updated version of my talk with it for JavaScript developers here: https://youtube.com/watch?v=siGKUyTk9M0 (52 min). That should give you a basic grounding on what's happening inside a Transformer.
This is a faithful reproduction of the original Transformer paper [1]. Except, these days we use trainable parameters for positional embedding. The paper used a static calculation for positional embedding using sine and cosine.
Figure 1 in the paper can be seen implemented in the forward() method of the GPT class in model.py. Here are the rough steps:
1. Tokens are embedded using a nn.Embedding layer.
2. Tokens are positionally embedded using a nn.Embedding layer.
3. The two embedding values are added to make the input x.
4. A sequence of N number of transformer blocks are then executed. This is the grey box in the left of the Figure 1. This is where all the magic happens. Chiefly in the self attention calculation. You can see this in the forward() method of the CausalSelfAttention class.
5. A regular nn.Linear layer is executed.
6. Finally the output token probabilities are calculated using F.cross_entropy (shown as softmax in the figure).
I hope this helps a little. Please feel free to suggest improvements and additions.
I have also been working from last 2 weeks on a gpt implementation in C. Eventually it turned out to be really slow (without CUDA). But it taught me how much memory management and data management there is when implementing these systems. You are running like a loop billions of times so you need to preallocate the computational graph and stuff. If anyone wanna check out it's ~1500 LOC single file:
https://github.com/attentionmech/gpt.c/blob/main/gpt.c
reply