SmolGPT: A minimal PyTorch implementation for training a small LLM from scratch

attentionmech · 2025-01-30T10:30:09 1738233009

This is cool, and timely (I wanted a neat repo like that).

I have also been working from last 2 weeks on a gpt implementation in C. Eventually it turned out to be really slow (without CUDA). But it taught me how much memory management and data management there is when implementing these systems. You are running like a loop billions of times so you need to preallocate the computational graph and stuff. If anyone wanna check out it's ~1500 LOC single file:

https://github.com/attentionmech/gpt.c/blob/main/gpt.c

sitkack · 2025-01-29T19:50:41 1738180241

Neat, I love projects like these.

The next level down is to do it directly in numpy.

And then from there, write a minimal numpy work-a-like to support the model above.

You start with a working system using the most powerful abstractions. Then you iteratively remove abstractions, lowering your solution, then when you get low enough but still riding on an external abstraction, you rewrite that, but ONLY to support the layers above you.

Following the above pattern, you can bootstrap yourself to have full system understanding. This is not unlike RL+distillation that human persons do learn complex topics.

bee_rider · 2025-01-29T22:57:14 1738191434

Numpy can use the chipmaker’s BLAS (Intel MKL or AMD’s Blis fork). Trying to replace it could be a good academic exercise but I think most people wisely leave that to the vendors.

sitkack · 2025-01-29T23:17:25 1738192645

It is a purely pedagogical device, like building a go kart.

sitkack · 2025-01-30T19:12:42 1738264362

A lightweight, pure Python, numpy compliant ndarray class.

https://github.com/wadetb/tinynumpy

lagrange77 · 2025-01-29T20:39:49 1738183189

> but still riding on an external abstraction, you rewrite that, but ONLY to support the layers above you.

i don't get it. Why do i stop before stripping all abstractions?

byteknight · 2025-01-29T21:04:44 1738184684

Where do you get that? He is postulating the external abstraction you are using has more features than you use. He is saying implement only the parts you use.

lagrange77 · 2025-01-29T21:15:27 1738185327

> Where do you get that?

From "when you get low enough but still riding on an external abstraction".

> He is saying implement only the parts you use.

Thanks.

sitkack · 2025-01-29T21:50:48 1738187448

Correct, I should proof read my posts.

tomrod · 2025-01-29T21:17:58 1738185478

Likewise. And your comment reminded me of real programmers*

* https://xkcd.com/378/

febin · 2025-01-30T06:53:01 1738219981

Here's a google collab notebook built from this. It takes ~2 hours on A100 GPU if you have collab pro. Might work on free account as well.

https://colab.research.google.com/drive/1dklqzK8TDPfbPbyHrk3...

c0wb0yc0d3r · 2025-01-30T00:52:35 1738198355

Can someone help me understand what I’m looking at here? This repository allows me to train a specific model on a specific data set, and finally test the result? Is that correct?

I am interested in how large and small language models are trained, but as someone who has little knowledge in this world I find it hard to cut through the noise to find useful information.

Really I’m looking for an open source project that helps a person gain this knowledge. Something like a docker container that encapsulates all the dependencies. When training it will use any available gpu or tell me why my gpu can’t be used and then fall back to cpu. Then had a simple interface to test the training results. Finally you can easily pull back the curtain to understand the process in better detail and maybe even adapt it to different model to experiment.

Does something like that exist?

MacTea · 2025-01-30T04:45:28 1738212328

https://course.fast.ai/ is the best. From their site: " A free course designed for people with some coding experience, who want to learn how to apply deep learning and machine learning to practical problems. "

c0wb0yc0d3r · 2025-01-30T05:44:57 1738215897

This is at the top of my lunch time learning list. Not quite what I’ve been envisioning but it’s in the right direction. Thanks!

TheTaytay · 2025-01-30T17:45:57 1738259157

If you are looking for something that actually explains most concepts behind it, Karpathy’s series will teach you what you want to know. (Mentioned elsewhere) If you are looking for command line tools to fine tune and evaluate models on known datasets, this article is a good take!

https://www.philschmid.de/fine-tune-llms-in-2025

timnetworks · 2025-01-30T01:14:45 1738199685

As opposed to inference (like generating text and images), training requires some more math (fp16 or bf16) and a single CPU generally won't cut it.

The prepare/train/generate instructions in the github linked are pretty much it for the 'how' of training a model. You give it a task and it does it for 1 billion trillion epochs and saves the changes incrementally (or not).

Training a LoRA for an image model may be more approachable, there's more blog entries etc on this, and the process is largely similar, except you're doing it for a single slice instead of the whole network.

[edit] I'm also learning so correct me if I'm off, hn!

sva_ · 2025-01-30T09:56:38 1738230998

> You give it a task and it does it for 1 billion trillion epochs and saves the changes incrementally (or not).

Somewhat confusingly, big LLM are most just trained for 1 epoch afaik.

_joel · 2025-01-30T10:55:23 1738234523

I've seen 3 epochs on some of the finetuning R1 blog posts. It's not my field so not sure how valid that is.

SJC_Hacker · 2025-01-30T02:58:18 1738205898

Do you have a good theoretical foundation in ML ? You will also need some linear algebra.

If not would invest the time in a decent course, there are plenty online, even offline if you are close enough to where its offered. I took one from Andrew Ng on Coursera years ago, which used Matlab. There are probably much better, more up-to-date options now, especially now that LLMs are very in-vogue. The fundamentals such as gradient descent, ANNs and back-propagation however, is still relevant, and hasn't changed much.

Trying to understand what code is doing without that foundation will be an exercise in futility.

c0wb0yc0d3r · 2025-01-30T05:31:47 1738215107

I don’t have a solid ML foundation, and it’s been a decade or more since I’ve worked with linear algebra.

For now I think that might be too deep for what I’m after. I’m at the beach looking out at the vast ocean that is machine learning and LLMs.

barrenko · 2025-01-30T08:30:44 1738225844

You're probably having the right hunch, it takes a crapload of time, especially if you want to implement and not just "get an intuition".

numba888 · 2025-01-29T19:09:28 1738177768

github has a bunch of them for years, the most known from Andrej Karpathy:

https://github.com/karpathy/nanoGPT

some other have MoE implemented.

syassami · 2025-01-29T22:23:06 1738189386

Personal fave: https://github.com/karpathy/llama2.c

benreesman · 2025-01-29T19:50:15 1738180215

nanoGPT is awesome (and I highly recommend his videos on it), but it’s closer to a direct reproduction of GPT-2, so it’s cool to have a really clean implementation of some newer ideas.

Nimitz14 · 2025-01-29T20:16:49 1738181809

nanoGPT contains some new ideas. https://github.com/karpathy/minGPT is more plain

Lerc · 2025-01-29T20:12:32 1738181552

The example story is interesting.

I have made my own implementation from scratch with my own multi-channel tokeniser, each channel gets its own embedding table 32768, 256,256, 64, and 4. Which are summed along with the position encoding.

Yet with all of those differences, my stories have Lily as a protagonist often enough that I thought I had a bug somewhere.

Might have to check tinystories for name distribution.

Most questionable output from mine so far:

"one day, a naughty man and a little boy went to the park place to find some new things."

OmAlve · 2025-01-30T11:57:32 1738238252

Thanks a lot for posting this here! I can't believe it went viral, makes all the efforts feel worth it now! - Om Alve

qrios · 2025-01-30T22:47:36 1738277256

Looks great!

Easy to put into a container and the first training with the defined data is already running (on a 16core, 128GB, Nvidia 3090 with 14GB available).

brap · 2025-01-29T21:29:31 1738186171

It’s interesting that technology so transformative is only a few hundred lines of code (excluding underlying frameworks and such).

How big would you guess state of the art models are, in terms of lines of code?

miki123211 · 2025-01-29T22:02:50 1738188170

Llama2 inference can be implemented in 900-ish lines of dependency-free C89, with no code golfing[1]. More modern architectures (at least the dense, non-MoE models) aren't that much more complicated.

That code is CPU only, uses float32 everywhere and doesn't do any optimizations, so it's not realistically usable for models beyond 100m params, but that's how much it takes to run the core algorithm.

[1] https://github.com/karpathy/llama2.c

hatthew · 2025-01-30T01:47:17 1738201637

A minimal hardcoded definition of the structure: probably a few hundred lines.

The actual definition, including reusable components, optional features, and flexibility for experimentation: probably a few thousand.

The code needed to train the model, including all the data pipelines and management, training framework, optimization tricks, etc.: tens of thousands.

The whole codebase, including experiments, training/inference monitoring, modules that didn't make it into the final architecture, unit tests, and all custom code written to support everything mentioned so far: hundreds of thousands.

quantadev · 2025-01-29T22:57:15 1738191435

I noticed several people mentioned Karpathy already, but I wanted to include that his tiny "Micrograd" project (see Youtube Video and GitHub) is a great introduction to Neural Nets (Multilayer Peceptron), which is at the core of [most] machine learning of course.

ks2048 · 2025-01-29T19:28:40 1738178920

So, this has nothing to do with "SmolLM" - a set of models (with data, training recipes, etc) released by HuggingFace? https://huggingface.co/blog/smollm

mkagenius · 2025-01-30T06:03:52 1738217032

Looks like a rip off of - https://github.com/PraveenRaja42/Tiny-Stories-GPT

without any credits to above or TinyStories paper.

zerd · 2025-01-30T23:48:41 1738280921

The one you linked to is based on Karpathy's tutorial: https://www.youtube.com/watch?v=kCc8FmEb1nY, except changing to train on tinystories in stead of Shakespeare. smolGPT also looks inspired by nanoGPT, also from Karpathy: https://github.com/karpathy/nanoGPT/blob/master/train.py

yorwba · 2025-01-30T07:33:40 1738222420

The implementations are different, so I don't think you can consider it a rip-off.

mkagenius · 2025-01-30T08:21:35 1738225295

Why do you say implementations are different?

yorwba · 2025-01-30T09:17:33 1738228653

Because I read the code.

ideashower · 2025-01-30T15:39:24 1738251564

Can anyone share what a training dataset would look like for something like this? What are some use cases?

qrios · 2025-01-30T22:55:09 1738277709

It is downloading a dataset (TinyStories) from huggingface[1]. Here you can drill down deeper into the structure and content of the source data.

[1] https://huggingface.co/datasets/roneneldan/TinyStories

efm · 2025-01-30T17:44:28 1738259068

Karpathy's nanoGPT has a full training pipeline using Shakespeare. [1]

The use case for this is learning in simple example.

[1] https://github.com/karpathy/nanoGPT

imdsm · 2025-01-30T09:52:42 1738230762

Is there a corresponding article for this? I'd love to read through it!

the_real_cher · 2025-01-29T19:58:57 1738180737

Any body have any good readings they read and liked to kind of understand what is going on with how this works?

fragmede · 2025-01-29T20:52:19 1738183939

https://spreadsheets-are-all-you-need.ai/

ianand · 2025-01-30T02:58:10 1738205890

hey, creator of spreadsheets-are-all-you-need.ai here. Thanks for mentioning!

I now have a web version of GPT2 implemented in pure JavaScript for web developers at https://spreadsheets-are-all-you-need.ai/gpt2/.

The best part is that you can debug and step through it in the browser dev tools: https://youtube.com/watch?v=cXKJJEzIGy4 (100 second demo). Every single step is is in plain vanilla client side JavaScript (even the matrix multiplications). You don't need python, etc. Heck, you don't even have to leave your browser.

I recently did an updated version of my talk with it for JavaScript developers here: https://youtube.com/watch?v=siGKUyTk9M0 (52 min). That should give you a basic grounding on what's happening inside a Transformer.

leopoldj · 2025-01-30T14:17:50 1738246670

This is a faithful reproduction of the original Transformer paper [1]. Except, these days we use trainable parameters for positional embedding. The paper used a static calculation for positional embedding using sine and cosine.

Figure 1 in the paper can be seen implemented in the forward() method of the GPT class in model.py. Here are the rough steps:

1. Tokens are embedded using a nn.Embedding layer. 2. Tokens are positionally embedded using a nn.Embedding layer. 3. The two embedding values are added to make the input x. 4. A sequence of N number of transformer blocks are then executed. This is the grey box in the left of the Figure 1. This is where all the magic happens. Chiefly in the self attention calculation. You can see this in the forward() method of the CausalSelfAttention class. 5. A regular nn.Linear layer is executed. 6. Finally the output token probabilities are calculated using F.cross_entropy (shown as softmax in the figure).

I hope this helps a little. Please feel free to suggest improvements and additions.

[1] https://arxiv.org/pdf/1706.03762

Diffused_asi · 2025-01-30T14:16:07 1738246567

How many parameters model is this ?

antirez · 2025-01-29T21:12:57 1738185177

No cpu / mps support to train on Macs, apparently.

leopoldj · 2025-01-29T21:28:23 1738186103

You should be able to make a few small changes to support "mps".

In TrainingConfig set the device to "mps". The run training.

In sample.py modify parse_args() and add support for mps as a possible value for the --device argument.

antirez · 2025-01-29T21:48:45 1738187325

Thanks! I'll try. I didn't bother believing that if this was developed heavily on CUDA, it was likely going to use kernels that were missing in MPS.

nostradumbasp · 2025-01-29T19:53:15 1738180395

Cute! Keep making fun things.

spidermonkey23 · 2025-01-29T20:31:08 1738182668

Is there anything that can run locally on mobile in temrux