Understanding LoRA: Low-Rank Adaptation

Note: The following understanding is just my own interpretation. For authoritative explanations, you can refer to the paper. Of course, authoritative explanations can sometimes be difficult to understand. If you have any questions, feel free to point them out in the comments 🥰

What is LoRA#

In simple terms, LoRA is a technique that can fine-tune the parameters in a model (neural network). You can think of a model as a function with many parameters; you input something, and this model outputs a number, text, or image. Taking ChatGPT as an example, the text you input is first converted into numbers (Word Embedding), and then these numbers are fed into the model function, which produces a series of output numbers, each of which is between $[0,1]$ . Each number corresponds to a word, representing the probability of that word appearing in the sentence. ChatGPT "spits out" words one by one in this way.

Since the model is a function, each parameter in the function, such as $y= ax + b$ where $a$ and $b$ are parameters of this linear function, can be changed. If you modify any one of these parameters, the position of the line represented by this linear function will change. The same goes for the model; if you change a parameter in the model, the functionality of the model will also change.

So, if you download a model online that is used to classify different fruits, but when you use this model for classification tasks, you find that its performance is not great—for example, it mistakenly identifies a grape as a raisin—it indicates that the parameters in this model are not good, leading to poor performance. Therefore, you want to adjust the parameters in this model to help it correctly identify grapes as raisins. One way to adjust the model parameters is by using LoRA. In simple terms, if you feel that the current model's functionality does not meet your requirements, you can use LoRA to fine-tune the current model.

Note: To intuitively feel the capabilities of LoRA, I want to mention my previous experience with Stable Diffusion (AI drawing software), where the model affects whether the generated image is 2D or 3D, and you can add LoRA to this model to fine-tune the AI artist. I clearly remember there was one called Ink Wash Style LoRA. If you apply this LoRA to the model, the final image will have an ink wash style, but the essential difference in the image—whether it is 2D or 3D—remains unchanged.

Image after fine-tuning with Ink Wash Style LoRA

Why fine-tune existing models instead of training a model from scratch#

Currently, there are more and more types of large models. Basically, most large internet companies have released their own large models. Why don't we train a large model ourselves but instead want to fine-tune models that others have already trained? One major reason is "size." The parameters of current models often exceed 7B or 12B (billion). Modifying (training) so many parameters requires fast computing power (high GPU and CPU performance) and sufficient memory to store these parameters for computation. For example, if we take a 7B model, and if the parameter type is float16, with each parameter occupying 2 bytes, then 7B parameters would occupy $2*7000000000=14000000000$ bytes, which is about 13GB. This means that if you want to run this model on a GPU, the graphics card needs 13GB of memory. Moreover, if you want to adjust the parameters, the computer must also reserve space for each parameter to store the gradient for each model, so to train this model, you need at least $13*2=26$ GB. Such large memory is generally not available in consumer-grade graphics cards; for instance, the best-performing consumer GPU, the 4090, has a maximum memory of 24GB. Professional computing cards like the A100 can reach 80GB, but they are very expensive.

Since training a large model requires many professional-grade GPUs, only large internet companies can afford to train them. Some large companies will open-source their trained models, meaning they make their trained models public, allowing others to adjust the parameters based on them.

Therefore, to adjust the parameters of a model on a regular consumer-grade GPU, meaning to change the model's performance on an ordinary person's computer, we can only fine-tune some parameters; it is impossible to adjust all parameters in the same way that large companies train models. It can be said that if we want the model we need to be trained to 100%, the initial training by large companies accounts for 80%, and the fine-tuning on our computers accounts for the remaining 20%.

LoRA#

As I mentioned above, ordinary people's computers can change the model's performance, but we can only fine-tune some parameters. However, there are various ways to fine-tune these parameters, and one of the most common methods is LoRA.

The paper mentions a formula that I think is quite important:

\begin{equation} h=W_0x+\Delta Wx=W_0x+BAx=(W_0+BA)x \tag{1} \end{equation}

Here, $W_0$ is the original parameters of the model, represented in matrix form. According to the paper, the numbers in the $W_0$ matrix remain unchanged, while $\Delta W = BA$ is another matrix, and LoRA mainly changes this matrix. By adding $W_0$ and $\Delta W$ , we obtain a new parameter $W_1$ . When we apply this new parameter $W_1$ to the model, the model's parameters change, and the functionality of the model also changes, achieving the purpose of fine-tuning.

Now we encounter a problem. According to matrix addition, if two matrices want to be added, they must have the same size. For example, in $equation (2)$ below, the size of the $W_0$ matrix is $3*3$ . If $\Delta W$ is to be added to $W_0$ , then $\Delta W$ must also be $3*3$ . If we directly change this $\Delta W$ , meaning changing the $3*3=9$ parameters, why not just change the parameters $W_0$ in the original model directly, since both matrices are the same size? Then our goal would not be achieved—ordinary people's computers can only change some parameters of the model's performance, as it is impossible to change the thousands of parameters in the original model due to the computing power and memory of consumer-grade computers.

\begin{equation} W_0 = \left( %left parenthesis \begin{matrix} %this matrix has 3 columns, each centered 1 & 2 & 3\\ %first row elements 4 & 5 & 6\\ %second row elements 7 & 8 & 9\\ %third row elements \end{matrix} \right) \tag{2} \\ \end{equation}

To solve this problem, the paper proposes a new method:

LoRA allows us to train some dense layers in a neural network indirectly by optimizing rank decomposition matrices of the dense layers’ change during adaptation instead, while keeping the pre-trained weights frozen.

In this sentence, I think the key term is rank decomposition.

What is rank decomposition#

According to matrix multiplication, if matrix $A$ has a size of $4*3$ and matrix $B$ has a size of $3*4$ , then their product $AB$ will have a size of $4*4$ .

I will use the previous equation as an example. From $equation (2)$ , we can see that the size of $W_0$ is $3*3$ . According to matrix addition, $\Delta W$ must also be $3*3$ . The key point is that to obtain $\Delta W$ , we can let $\Delta W = BA$ , as shown in $equation (3)$ . This means that we transform $\Delta W$ into the product of two matrices. Here, matrix $B$ has a size of $3*r$ , and matrix $A$ has a size of $r*3$ , where $r$ is a variable that can be changed. This $r$ is also very important, which I will explain later. According to matrix multiplication, the result of $BA$ will have a size of $3*3$ . This way, it can be added to $W_0$ .

Let r=1, then \\ \begin{equation} B = \left( \begin{matrix} b_1 \\ b_2 \\ b_3 \\ \end{matrix} \right) A = \left( \begin{matrix} a_1 & a_2 & a_3 \\ \end{matrix} \right) \\ \end{equation}

\begin{equation} \Delta W = BA = \left( \begin{matrix} b_1 \\ b_2 \\ b_3 \\ \end{matrix} \right) * \left( \begin{matrix} a_1 & a_2 & a_3 \\ \end{matrix} \right) = \left( \begin{matrix} b_1a_1 & b_1a_2 & b_1a_3 \\ b_2a_1 & b_2a_2 & b_2a_3 \\ b_3a_1 & b_3a_2 & b_3a_3 \\ \end{matrix} \right) \tag{3} \end{equation}

From $equation (3)$ , we can see that a large matrix with 9 parameters is decomposed into the product of two smaller matrices with 3 parameters each. Rank decomposition can be simply understood as decomposing a large matrix into the product of smaller matrices. Thus, to obtain $\Delta W$ , we only need to focus on changing 6 parameters in the two small matrices, which is significantly fewer than the 9 parameters in $\Delta W$ .

The reduction in training parameters brought about by LoRA training is significant. The paper mentions:

When the pre-trained model is GPT-3 175B, the number of trainable parameters can be as small as 0.01% of $W_0$.

This means that using LoRA to train GPT-3, the trainable parameters are only 0.01% of the model parameters.

About the selection of the r $(rank)$ parameter in LoRA#

From the previous discussion, we know that $\Delta W$ is decomposed into the product of two matrices, where matrix $B$ has a size of $3*r$ and matrix $A$ has a size of $r*3$ . According to linear algebra, generally, the rank of the resulting matrix from the multiplication of two matrices will not exceed the minimum rank of the two original matrices. This means that if we choose $r$ to be less than 3, then the rank of the matrix obtained from the multiplication of $A$ and $B$ will not exceed $r$ , which is less than 3. In fact, for this example, we must choose $r$ to be less than 3. If we choose $r$ to be 4, then the parameters to be changed (trained) would be $4*3 + 3*4 = 24$ , which is more than the original model's parameters, contradicting our purpose of using LoRA—to reduce the number of trainable parameters.

Herein lies a problem: if the original parameter matrix $W_0$ is full rank, but the rank of $\Delta W$ obtained through LoRA is less than that of $W_0$ . This means that LoRA can only fine-tune a subspace of the model parameter matrix. Still using $equation (2)$ as an example, the rank of $W_0$ is 3. Simply put, this model needs to use a three-dimensional coordinate system to model the problem. However, if we use LoRA and set $r=1$ , we are only changing one dimension of the model, while the other two dimensions are not considered. In this case, no matter how we change the model's parameters, the effect will be relatively poor.

Since LoRA can only fine-tune a subspace of the model parameter matrix, why should we use LoRA? Regardless, there will always be one or two dimensions that are not considered.

Fortunately, as noted in the previous paper, the learned over-parameterized models in fact reside on a low intrinsic dimension. This means that we give the model a three-dimensional coordinate system for modeling, but in the end, the model only utilizes a line; it only uses one of the dimensions. This indicates that the column vectors in the model parameter $W_0$ , this $3*3$ matrix, are linearly dependent, and the rank of the model parameters is actually 1, which is what the paper refers to as low intrinsic dimension. In this case, if we set $r=1$ , the effect will be better. We only need to focus on changing 6 parameters in the two small matrices, which is significantly fewer than the 9 parameters in $\Delta W$ . However, we can also set $r=2$ , but this would increase the trainable parameters to 12, which has a side effect.

Since we do not know how many dimensions the model uses for modeling, the parameter $r$ needs to be manually adjusted to find the optimal value. This type of parameter is called a hyperparameter. If $r$ is too large, the number of trainable parameters increases; if $r$ is too small, the training effect is poor.

Advantages of LoRA#

To form a comparison, I will first discuss other methods for fine-tuning model parameters.

Adapter Layers: This is equivalent to changing the model architecture by adding some layers to the model and only changing the parameters of these layers. However, there is a problem: this increases the inference time, which is technically referred to as Inference Latency. To draw an analogy, we can think of the model as a concrete road connecting A and B. Your input is a car at point A, and this car must travel along the concrete road to reach B and output the result. The car takes time on the road. Adapter Layers is like adding a small section of concrete road along the way and only optimizing that small section. However, the journey becomes longer, resulting in increased travel time for the car, which is the increased inference time. LoRA, on the other hand, seems to upgrade part of the road to a highway without changing the road length (not implying faster speed, but better performance; LoRA does not speed up inference).
Prefix tuning: Prefix tuning is actually a form of prompt engineering and does not directly change the model. It divides prompts into prefixes and suffixes. The user inputs the suffix, while the prefix is used for fine-tuning. This reduces the user's input, which is its drawback.