LoRA : finetuning giant models on small machines

Adaptation, Improvisation

It seems that LLM hype might be cooling off slightly. Anecdotally I hear more reports these days about ChatGPT not being very useful beyond coding and certain creative tasks. There is a good reason for this (besides short attention spans) - we can’t fine-tune these giant models for specific purposes.

In ages past, the big ML breakthroughs were demonstrations of methods. People would then use those methods to train their own models on their own data. However the new foundation models were trained using trillions of data points, millions of dollars of compute and months of dedicated supercomputers. We can’t easily build new models for specific use cases, so we are stuck with the large general purpose ones.

But using these giant models is not easy. Even GPT3/GPT4, the greatest AI foundation models in history, are mostly unusable by the average consumer without the additional finetuning from RLHF. However while finetuning takes less data and time than full training, loading and training LLMs still requires large compute clusters.

Meta, who has really been impressing with their open source contributions recently, released a whole series of LLMs of different sizes so we can study their compute requirements more precisely. LLaMa, as the models are called has 5 versions specified by number of billions of parameters - 7, 13, 33, and 65. LLaMa 7B, the smallest model, takes 30GB of RAM just to load it in memory, the 65B model requires 780GB memory. And this is just for inference, training requires even more. The best NVidia GPU today, the A100, costs $15000 and has only 80GB memory. So for all intents, fine-tuning large LLMs can only be done by the big corporations who can spend a lot of money, GPUs and time. Until…

LoRA is for Low Rank Adaptation

The LoRA paper was published in Sep 2021, i.e. before ChatGPT and the recent AI hype cycle. They finetuned GPT3, the 175B parameter monster which requires 1.2TB of memory. By using their new method, they were able to reduce the memory usage from 1.2TB to 350GB. This is still too large for most average users and they also saw a drop in model performance and so this paper went fairly unnoticed at the time.

In May 2023, the QLoRA (Quantized LoRA) improved upon this method, and this time, they were able to benchmark their methods on the zoo of LLMs of different sizes that we have today. While the GPT3 memory advantage was the same, QLoRA reduced the memory requirement of the 65B LLaMa model from 780GB down to 41GB, putting it within the capability of a single GPU. The 7B models was reduced to just 5GB, making the LLaMa models accessible to an average Mac Pro!! And they reportedly saw no measurable drop in performance.

So how did they do that?

Background - Matrix Multiplication Magic

All deep learning is in essence, matrix multiplication. A neural network is made of multiple layers each of which is just a matrix. Input and output data can be described as vectors, so a network with n layers transforms the input vector into an output vector by simply matrix multiplying each layer matrix in sequence.

V_out = W1 x W2 x W3 x … Wn x V_in

During training we use known input-output pairs of data and labels and make changes to layer matrics such the network satisfies the above relation as closely as possible for as much of the data as possible. At the end of training we have the learned set of matrices whose values are frozen and applied to new inputs to produce new outputs.

Intuition - Matrix Decomposition

Say a matrix W is of shape 10×10 and thus has 100 elements. Training the matrix involves changing each element to the desired values.

We can also create 10×10 matrix as a product of two matrices A and B where A is shape 10×2 and B is shape 2×10. In this case both A and B only have 20 elements each i.e. 40 elements total. Thus A x B is a low-rank decomposition of W since we are representing a 100 element matrix using only 40 elements. Training this matrix only requires changing 40 variables instead of 100, which is much more efficient. So if you want to train and make changes to W, it is more efficient to keep W constant, train the elements of A and B and simple update W => W + A x B.

As the size of W increases the low-rank advantage also increases. In the LoRA paper, the authors decompose all 96 attention layers of GPT3 from rank 64 down to rank 2 and thus getting a low rank representation of size 4.7 million instead of 175 billion elements. During fine-tuning, instead of loading the full 175 billion elements of the language model and finetuning them with additional training data, we hold the LLM constant and only update the 4.7 million low rank representation such that the LLM x LoRA is a fine-tuned model.

No Free Lunch

Now you might be thinking this does not make any sense. If we can decompose a large matrix into a smaller ones, then why not just make a smaller model in the first place.

We don’t fully understand the exact way complex non-linear interactions create complex representations of information in deep learning models, but we do know some things. We know that it is possible to reorganize 40 elements into 100 or even 100,000 elements. But that does not increase the information content in the matrix which depends on the number of “independently” varying elements. Therefore the LoRA has less information that the full model. This is why the performance of LoRA finetuning will always be lower than the performance of finetuning the full model., depending the finetuning task and dataset. While the LoRA paper saw significant performance decrease, in the qLoRA paper, the authors stack other improvements and boost the performance quite a bit. The resultant Guanaco model, a qLoRA model finetuned on LLaMa using just a single GPU, can reach up to 99.3% of ChatGPT’s performance.

One huge advantage of LoRA over full finetuning is that we can train many low rank adapters for the same foundation model. Each adapter can be trained on a different task on completely different datasets, but they all work with the same foundation model to produce different finetuned models. In the paper they test 10 different adapters on GPT3 at the same time, GPT3 took 350 GB and each adapter took 35Mb each. This is game-changing considering the constraints of mobile computing and the data privacy. If many specialized low rank adapters that fit on your phone can be trained on device for your tasks using your personal data, it puts the full power of the greatest AI advances right in your pocket.

———————————————————————————————————
There you have it, my intuitions on how LoRA works and brings fine-tuning giant foundation models within reach of single GPUs and home computers. For more such intuitions on AI/ML, subscribe and follow me on Twitter. You can also check out my other projects on nirsd.com.