Understand LoRA: Low-Rank Adaptation

Note: The following understanding is just my own understanding. If you want to see an authoritative explanation, you can look at paper. Of course, authoritative explanations are sometimes difficult to understand. If you have any questions, please correct me in the comment section🥰

What is LoRA

In short, LoRA is a technology that can fine-tune the parameters in the model (neural network). The model can be regarded as a function with many parameters. You input something, and the model will output a number, text or image. Take ChatGPT as an example. The text you input will first be converted into numbers (Word Embedding), and then these numbers are put into the model function. The model will give a series of output numbers, each number is between $[0,1]$, and each number will correspond to a word. This number represents the probability of this word appearing after this sentence. ChatGPT "spits out" words one by one like this.

Since the model is a function, each parameter in the function, such as $ y = ax + b $ where $a$ and $b$ are parameters of this linear function, if you change any of the parameters, the position of the line represented by this linear function will change. The same is true for the model. If you change a parameter of the model, the function of the model will also change.

So, if you download a model from the Internet, this model is used to classify different fruits, but when you use this model for classification tasks, you find that the performance of this model is not very good, for example, this model mistakenly recognizes a grape as a raisin. This means that the parameters in this model are not good, resulting in poor performance of the model. So you want to adjust the parameters in the following model so that the model can successfully recognize grapes as raisins. One way to adjust the model parameters is to use LoRA. In short, if you think the function of the current model does not meet your requirements, you can use LoRA to fine-tune the current model.

Note: To give you an intuitive feel of LoRA's capabilities, I'd like to talk about the Stable Diffision (AI drawing software) I was working on before. The model in it affects whether the generated image is 2D or 3D, and you can add LoRA to this model to fine-tune the AI painter. I remember very clearly that there was a thing called Ink-and-wash LoRA. If you apply this LoRA to the model, the final image will have an ink-and-wash style, but the essential difference between the image - 2D or 3D, remains unchanged.

Fine-tune the existing model, why not train a model directly

At present, there are more and more types of large models. Basically, the larger Internet companies have released their own large models. Why don't we train a large model ourselves, but want to fine-tune the model trained by others? One of the main reasons is "large". The parameters of the current model are 7B, 12B (billion). To modify (train) so many parameters, on the one hand, the computer needs to calculate quickly (graphics card, high CPU computing power), and on the other hand, the computer needs to have enough memory to store so many parameters for the computing unit to calculate. Let's take the 7B model as an example. If the type of the model parameter is float16, and the parameter of a model occupies 2 bytes, then 7B parameters occupy $2 * 7000000000=1400000000$ bytes, which is 13GB when converted to GB. That is to say, if you want to run this model on a GPU, the graphics card's video memory needs to be 13GB, and if you want to adjust the parameters in it, the computer must reserve a parameter space for each parameter to store the gradient of each model, so it takes at least $13 * 2=26$GB to train this model. Such a large video memory is generally not available on consumer-grade graphics cards. For example, the video memory of the current best-performing consumer-grade GPU--4090 is at most 24GB. Professional computing cards such as A100 can reach 80GB, but professional computing cards are expensive.

Since training a large model requires many professional-grade graphics cards, only large Internet companies can afford it. Some large companies will open source their trained models, that is, make their trained models public, so that others can adjust the parameters in the model based on them.

Therefore, in order to be able to adjust the parameters in the model on ordinary consumer-grade graphics cards, that is, to change the performance of the model on ordinary people's computers, we can only fine-tune some parameters in it, and it is impossible to adjust all the parameters in the model like large companies train models. It can be said that if we want to train the model we need to 100%, the initial training of large companies will account for 80%, and then fine-tuning the model on our computer will account for 20%.

LoRA

As I mentioned above, the performance of the model can be changed on ordinary people's computers. We can only fine-tune some parameters in it, but there are many ways to fine-tune these parameters, and the most common way is LoRA.

There is a formula mentioned in the paper, which I think is more important:

$$ \begin{equation} h=W_0x+\Delta Wx=W_0x+BAx=(W_0+BA)x \tag{1} \end{equation} $$

Among them, $W_0$ is the original parameter of the model, which is in the form of a matrix. According to the paper, the numbers in the $W_0$ matrix remain unchanged, while $\Delta W = BA$ is another matrix. LoRA mainly changes this matrix. By adding $W_0$ and $\Delta W$, a new parameter $W_1$ is obtained. If we apply this new parameter $W_1$ to the model, the parameters of the model change, and the function of the model also changes, achieving the purpose of fine-tuning.

Then we have a problem. According to the addition of matrices, if two matrices want to be added, then the two matrices must have the same size. For example, in the following $Equation (2)$, the size of the $W_0$ matrix is $3*3$. If $\Delta W$ is to be added to $W_0$, then the size of $\Delta W$ must also be $3 * 3$. If we change this $\Delta W$ directly, that is, change $3 * 3=9$ parameters, then why don't we just change the parameter $W_0$ in the original model? Anyway, the size of these two matrices is the same. Then our purpose is not achieved - in order to change the performance of the model on ordinary people's computers, we can only fine-tune some parameters in the model, because according to the computing power and memory of consumer-grade computers, it is impossible to change the thousands of parameters of the original model.

$$ \begin{equation} W_0 = \left( %左括号 \begin{matrix} %该矩阵一共3列，每一列都居中放置 1 & 2 & 3\\ %第一行元素 4 & 5 & 6\\ %第二行元素 7 & 8 & 9\\ %第三行元素 \end{matrix} \right) \tag{2} \ \end{equation} $$

To solve this problem, paper proposed a new method

LoRA allows us to train some dense layers in a neural network indirectly by optimizing rank decomposition matrices of the dense layers’ change during adaptation instead, while keeping the pre-trained weights frozen

Among them, I think the key word in this sentence is rank decomposition, which directly translates to rank decomposition?

What is rank decomposition

According to matrix multiplication, if the size of matrix $A$ is $4 * 3$ and the size of matrix $B$ is $3 * 4$, then the size of the $AB$ matrix is $4 * 4$ when they are multiplied.

Let me take the equation above as an example. From $Equation (2)$, we know that the size of $W_0$ is $3 * 3$. From matrix addition, the size of $\Delta W$ must also be $3 * 3$. Here comes the key point. In order to get $\Delta W$, we can let $\Delta W = BA$, as shown in $Equation 3$. In other words, turn $\Delta W$ into two matrix multiplications. The size of the $B$ matrix is $3 * r$, and the $A$ matrix is $r * 3$, where r is a variable and can be changed at will. This r is also very important, which I will explain later. From matrix multiplication, we can see that the matrix size of the $BA$ result is $3 * 3$. In this way, it can be added to $W_0$.

$$ \text{ when } r=1, \ B = \begin{pmatrix} b_1 \\ b_2 \\ b_3 \end{pmatrix}, A = \begin{pmatrix} a_1 & a_2 & a_3 \end{pmatrix}, \\ \Delta W = B A = \begin{pmatrix} b_1 \\ b_2 \\ b_3 \end{pmatrix} \begin{pmatrix} a_1 & a_2 & a_3 \end{pmatrix} = \begin{pmatrix} b_1 a_1 & b_1 a_2 & b_1 a_3 \\ b_2 a_1 & b_2 a_2 & b_2 a_3 \\ b_3 a_1 & b_3 a_2 & b_3 a_3 \end{pmatrix} \tag{3} $$

From $Equation (3)$, we can see that a large matrix with 9 parameters is decomposed into two small matrices with 3 parameters. Rank decomposition can be simply understood as the decomposition of a large matrix into the multiplication of small matrices. In order to obtain $\Delta W$, we only need to focus on changing the 6 parameters in the two small matrices. Compared with the 9 parameters in $\Delta W$, 6 parameters are much reduced.

The effect of reducing training parameters brought by LoRA training is significant. The paper mentions

When the pre-trained model is GPT-3 175B, the number of trainable parameters can be as small as 0.01% of $W_0$

That is, when training GPT-3 with LoRA, the trainable parameters are 0.01% of the model parameters.

About the selection of the r $(rank)$ parameter in LoRA

As mentioned above, $\Delta W$ is decomposed into the multiplication of two matrices, where the size of the $B$ matrix is $3 * r$ and the size of the $A$ matrix is $r * 3$. According to the knowledge of linear algebra, in general, the rank of the matrix resulting from the multiplication of two matrices will not exceed the minimum of the ranks of the two original matrices. In other words, if we choose r to be less than 3, then the rank of the matrix obtained by multiplying $A$ and $B$ will not exceed r, but less than 3. In fact, for this example, the r we choose must also be less than 3. If we choose r to be 4, then the parameters to be changed (trained) are $4 * 3 + 3 * 4 = 24$, which is more than the parameters of the original model, which violates the purpose of using LoRA-reducing the number of trainable parameters.

There is a problem here. If the original parameter matrix $W_0$ of the model is full rank, but the rank of $\Delta W$ we get from LoRA is less than $W_0$. In other words, LoRA can only fine-tune a subspace of the model parameter matrix. Let's take $Equation (2)$ as an example, the rank of $W_0$ is 3. Simply put, this model needs to be modeled in a three-dimensional coordinate system to solve the problem, but if we use LoRA and set $r=1$, we are only changing one dimension of the model, and the other two dimensions are not taken into account. In this way, no matter how the model parameters are changed, the effect is relatively poor.

Since LoRA can only fine-tune a subspace of the model parameter matrix, why do we use LoRA? In any case, there are one or two dimensions that are not taken into account.

Fortunately, from paper we know that: the learned over-parametrized models in fact reside on a low intrinsic dimension. In other words, we model the model's three-dimensional coordinate system, but the model only uses a line in the end, and it only uses one dimension. This shows that the column vectors in the $3*3$ matrix of the model parameter $W_0$ are linearly correlated, and the rank of the model parameter is actually 1, which is what the paper calls low intrinsic dimension. At this time, if we set $r=1$, the effect will be better. We only need to focus on changing the 6 parameters in the two small matrices. Compared with the 9 parameters in $\Delta W$, 6 parameters are much reduced. However, we can also set $r=2$, but this will reduce the number of trainable parameters to 12, which also has side effects.

Since we don't know how many dimensions the model uses for modeling, we need to manually try to get the optimal value of the parameter r, which is called a hyperparameter. If r is large, the trainable parameters will increase; if r is small, the training effect will not be good.

Advantages of LoRA

For comparison, let me first talk about other methods that can fine-tune model parameters

Adapter Layers: This is equivalent to changing the architecture of the model, adding some layers to the model, and then just changing the parameters of these layers. But there is a problem with this, which will make the model reasoning time longer, and the scientific name seems to be Inference Latency. By analogy, let's think of the model as a cement road connecting A and B. Your input is a car at A, and this car has to go through the cement road to B, and the output is the result. The car takes time on the road. Then Adapter Layers adds a small piece of cement road to this road and only optimizes this small piece of cement road. But the distance becomes longer, resulting in a longer journey time for the car, which is the longer reasoning time. But LoRA seems to upgrade part of the road to a highway without changing the length of the road (not to say that the speed is fast, but the effect is good, LoRA does not speed up the reasoning speed).
Prefix tuning: The prefix fine-tuning method is actually a prompt word engineering, and does not directly change the model. The prompt word is divided into a prefix and a suffix. The user inputs the suffix, and the prefix is occupied for fine-tuning. This reduces the user's input. This is its disadvantage.

Comments & Discussion

Loading comments system...

Table of Contents