This looks fantastic. Will try replacing our current fine-tuned FLAN-UL2 model with this.
I wonder how the devtooling around this will evolve. Seems like a matter of days until someone creates a GUI wrapper around this, and obviates the need to use programmer time for fine-tuning
I'm curious, what are the differences between T5, Flan-T5, and Flan-UL2 for fine-tuning? Does the instruction tuning matter at all, once you're fine-tuning?
Low-rank adaptation (LoRA) ... has some advantages over previous methods:
- It is faster and uses less memory, which means it can run on consumer hardware.
- The output is much smaller (megabytes, not gigabytes).
- You can combine multiple fine-tuned models together at runtime.
This is great news for my dream of building a fine-tuned interactive messenger, that can deliver a message on my behalf by training it on my personality & the information I want to convey.
And then you hook it up to hundreds of dating apps and it just does the boring job of making introductory chat presenting you at the end with only the women who are interested in a real date.
My friend was doing this with mIRC scripts on our city's DALnet room in 2001. His bot hit up random users, asked a/s/l, and if they were female would text to speech "BabyGurl18 is female".
So we'd be watching a movie, he'd have his computer speakers turned up in his room, his bot would announce and he'd get up and leave, already in conversation
When you consider what will happen when everyone on dating apps is an LLM, you see the issue: in the end, everyone is "interested" in everyone who swiped right.
Maybe software running on your PC to capture everything you type, a voice transcriber that filters out your voice specifically and records that, and you've got a dataset that covers a lot of who you are.
Fine tune a model on that and boom, you're "immortal", and as LLMs get better and better, the fidelity of "you" gets better and better.
No downsides or fun for yourself, the main use I could see for it would be for something like being able to "talk" to your great great great grandpa one day.
I suppose that is the one retaining upside for the other person when it comes to immortality. But for the individual, none of the immortality fun remains. You aren't meeting or talking to them as you would in "normal" immortality, you are still dead and don't know if they even exist, you don't even get to ensure the right things are passed down.
LoRA has actually been around for a little while! I first saw it when it became popular in fine-tuning models quantized down to about 8 bits or so. I'm sure it's doing stuff in the 4bit range now! :D
I believe it's a core toolbox piece of tech required to really push the limits of LLMs either in original training or in inference. Similar sort of to how batch norm was for convolutional neural networks. I look forward to seeing how this will be applied in the future.
The easiest way to run alpaca Lora locally is with this little known fork [1] that uses Docker. You’ll be up and running in about 20 min with pretty much any modern consumer Nvidia GPU.
Hi All, I have a noob question.
I have been reading about Alpaca and Alpaca Lora. I have a use case in which I want to fine tune/train Alpaca Lora on a large corpus of books which are in the txt format. I know for Alpaca, the data was in "Instruction : Prompt" format. however, my text is huge and is not in that format. It's simply a library of books and journal articles. I want to be able to ask a question and the model answers based on the books I trained it on. I also want to be able to ask general questions for example which books discussed topic x or y.
I have tried OpenAI's API to create embeddings, but I want to use Alpaca.
Has anybody made a llama/alpaca erebus model? I read about them in the oobabooga docs and a locally-run language model fine tuned on literotica could be the funniest thing I’ve ever seen.
NVIDIA stated recently that GPT bots will become one million times more powerful in ten years. Many people doubted that.
With LoRA, I see a much higher improvement. These guys claim a 10000 times reduction in parameter size. A different way to look at it, is that with the current hardware you can train a model that has 10000 times more parameters. If you add a 100x improvement in hardware in 10 years (not at all unrealistic), that's the million. But we will have significant improvements in training methods too.
You don't need 10,000 more data to train a model with 10,000 more parameters. You can use the same data and the model will perform better. Much better. The problem is that currently it is nearly impossible to run and train the model with more parameters.
ChatGPT has nearly 200 billion parameters. GPT-4 we don't know, there are rumors that it has 100 trillion parameters, but they are probably unfounded. In any case, we've seen how much more powerful GPT-4 is. Imagine a GPT-5 with 1 quadrillion parameters.
And then imagine that after you've trained it to some reasonable level, you "downsample" the parameters, using the SVD approach described in LoRA, and get a GPT-5 with the same 200 billion parameters as ChatGPT, but with many, many times more power, even than GPT-4.
They have already ingested most of the internet and all popular and semi popular books. If your plan doesn't involve every person on Earth wearing a go pro and uploading it to OpenAI you will have difficulty finding 9,999 more internets and libraries of congress.
The training process is modifying the network weights. These are usually written to copies of the file instead of overwriting it (because what if the loss is actually worse after an epoch of training?)
But there's nothing stopping inference from occurring on a model that is being trained.
LoRA is an alternative to traditional fine tuning (which is usually done on specific layers as you mentioned).
To quote the LoRA paper[1]:
> We hypothesize that the change in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed Low-Rank Adaptation (LoRA) approach. LoRA allows us to train some dense layers in a neural network indirectly by optimizing rank decomposition matrices of the dense layers’ change during adaptation instead, while keeping the pre-trained weights frozen
It's truly revolutionary: It basically lets you create a very small "diff" which you apply yo an existing model and it is suddenly fine tuned. These diff models are very small (5M for example).
Or “multiplex” fine-tuning with inference, ie. do fine-tuning for 100ms, inference for 100ms, then tuning again… etc
Btw, is there a way to combine two or more models?
So for example, if I create 5 copies of a model, then fine-tune each copy with a different dataset -> can the 5 datasets be merged together somehow to create a model that has the learning of the 5?
So they use cog before installing it? Apparently this wasn’t proofread.
Also, is it just me or there are currently more ways to run LLMs on a CPU than on a GPU springing up on GitHub? I have hacked my own, but my chat UI is awful, so what is the nicest, pre-packaged CUDA-friendly way to run this now?
How does LoRA save more than 50% of the memory usage? I see that the weight updates have much lower memory footprint by virtue if being low rank. But you still need the dense weights for the forward pass dont you?
I'm not an expert, but I believe it only saves memory in the final model, after training is done, by merging the low rank LoRA wrapper matrices with the original weight matrices.
For example, if an original layer has N inputs and outputs (an NxN weight matrix) LoRa adds a 16xN matrix before it and an Nx16 matrix after it, trains only those new matrices, and finally multiplies all three matrices to get a single 16x16 matrix.
It feels like I'm living in a cartoon with all these terms:
> In this blog post, we’ll show you how to use LoRA to fine-tune LLaMA using Alpaca training data.
I wonder how the devtooling around this will evolve. Seems like a matter of days until someone creates a GUI wrapper around this, and obviates the need to use programmer time for fine-tuning