神经网络推导-矩阵计算
为了理清如何进行神经网络的前向传播和反向传播的推导,找了很多资料,前向传播比较简单,重点在于如何进行反向传播的梯度计算
cs231n
课程推荐的计算方式是先进行单个元素求导,再逐步泛化到批量数据求梯度,参考
Derivatives, Backpropagation, and Vectorization - CS231n
Vector, Matrix, and Tensor Derivatives - CS231n
Backpropagation and Neural Networks - CS231n
自己也根据参考资料进行2层神经网络逐元素的推导
最好的方式当然是进行矩阵求导,在网上看了很多博客,比较好的有
神经网络矩阵计算最重要的内容是进行实值标量矩阵的一阶微分以及\(Jacobian\)矩阵的辨识,参考《矩阵分析与应用》,有以下先导知识
使用矩阵微分能够很便捷的实现神经网络反向求导,关键部分是辨识\(Jacobian\)矩阵,再转换成梯度矩阵
推导一
文章[矩阵求导]神经网络反向传播梯度计算数学原理给出了一个很好的推导方式,首先给出实现代码,然后使用矩阵计算逐步解释代码
PyTorch
教程Learning PyTorch with Examples给出了一个2
层神经网络的numpy
实现
1 | # -*- coding: utf-8 -*- |
第一步:定义网络参数
- 批量数据大小\(N=64\)
- 输入层神经元个数\(D_{in}=1000\)
- 隐藏层神经元个数\(H=100\)
- 输出层神经元个数\(D_{out}=10\)
1 | # N is batch size; D_in is input dimension; |
第二步:初始化数据、权重(该网络没有偏置向量)以及学习率
- 输入数据\(x\in R^{N\times D_{in}}\)
- 输出数据\(y\in R^{N\times D_{out}}\)
- 隐藏层权重矩阵\(w1\in R^{D_{in}\times H}\)
- 输出层权重矩阵\(w2\in R^{H\times D_{out}}\)
1 | # Create random input and output data |
第三步:迭代计算,输入批量数据到神经网络,进行前向传播
\[ h=x\cdot w1\\ h_{relu}=max(0, h)\\ y_{pred}=h_{relu}\cdot w2 \]
1 | # Forward pass: compute predicted y |
第四步:迭代计算,计算损失函数(误差平方和 - L1范数的平方)
\[ loss=\begin{Vmatrix} y_{pred}-y \end{Vmatrix}^{2} \]
1 | # Compute and print loss |
第五步:迭代计算,反向传播,计算输出层输入向量梯度
设\(y_{pred}-y=X\),\(X\)大小为\(N\times D_{out}\),则
\[ loss=\begin{Vmatrix} X \end{Vmatrix}^{2}= (vec(X))^{T}\cdot vec(X) \]
对损失函数\(loss(y_{pred})\)求输出层输入向量的微分
\[ dloss=d(tr(loss))=tr(dloss)=tr(d((vec(X))^{T}\cdot vec(X) ))\\ =tr(d(vec(X)^{T})\cdot vec(X)+vec(X)^{T}\cdot dvec(X))\\ =tr(d(vec(X)^{T})\cdot vec(X))+tr(vec(X)^{T}\cdot dvec(X))\\ =tr((dvec(X))^{T}\cdot vec(X))+tr(vec(X)^{T}\cdot dvec(X))\\ =tr((vec(X))^{T}\cdot dvec(X))+tr(vec(X)^{T}\cdot dvec(X))\\ =tr(2(vec(X))^{T}\cdot dvec(X))\\ =tr(2X^{T}\cdot dX) \]
所以Jacobian矩阵为\(D_{X}f(X)=2X^{T}\),梯度矩阵为\(\bigtriangledown_{X}f(X)=2X=2(y_{pred}-y)\)
1 | grad_y_pred = 2.0 * (y_pred - y) |
第六步:迭代计算,反向传播,计算输出层权重向量以及隐藏层输出向量梯度
\[ y_{pred}=h_{relu}\cdot w2 \Rightarrow dy_{pred}=dh_{relu}\cdot w2+h_{relu}\cdot dw2 \]
\[ dloss=tr(2X^{T}\cdot dX) =tr(2(y_{pred} - y)^{T}\cdot d((y_{pred} - y))) =tr(2(y_{pred} - y)^{T}\cdot dy_{pred})\\ =tr(2(y_{pred} - y)^{T}\cdot (dh_{relu}\cdot w2+h_{relu}\cdot dw2))\\ =tr(2(y_{pred} - y)^{T}\cdot dh_{relu}\cdot w2)+tr(2(y_{pred} - y)^{T}\cdot h_{relu}\cdot dw2)\\ =tr(w2\cdot 2(y_{pred} - y)^{T}\cdot dh_{relu})+tr(2(y_{pred} - y)^{T}\cdot h_{relu}\cdot dw2) \]
输出层权重向量的Jacobian矩阵为\(2(y_{pred} - y)^{T}\cdot h_{relu}\),梯度矩阵为\((h_{relu})^{T}\cdot 2(y_{pred} - y)\)
隐藏层输出向量的Jacobian矩阵为\(w2\cdot 2(y_{pred} - y)^{T}\),梯度矩阵为\(2(y_{pred} - y)\cdot (w2)^{T}\)
1 | grad_w2 = h_relu.T.dot(grad_y_pred) |
第七步:迭代计算,反向传播,计算隐藏层输入向量梯度
\[ h_{relu}=max(0, h) \Rightarrow dh_{relu}=\left\{\begin{matrix} dh & h\geq 0\\ 0 & h < 0 \end{matrix}\right. =1(h\geq 0)*dh \]
激活函数是逐个元素操作,所以使用Hadamard积
\[ dloss=tr(w2\cdot 2(y_{pred} - y)^{T}\cdot dh_{relu})\\ =tr(w2\cdot 2(y_{pred} - y)^{T}\cdot 1(h\geq 0)* dh)\\ =tr((2(y_{pred} - y)\cdot (w2)^{T})^{T}\cdot 1(h\geq 0)* dh)\\ =tr((2(y_{pred} - y)\cdot (w2)^{T})^{T}* 1(h\geq 0)^{T}\cdot dh) \]
所以Jacobian矩阵为\((2(y_{pred} - y)\cdot (w2)^{T})^{T}* 1(h\geq 0)^{T}\),梯度矩阵为
\[ \bigtriangledown_{h}f(h)=1(h\geq 0)\cdot 2(y_{pred} - y)\cdot (w2)^{T} \]
1 | grad_h = grad_h_relu.copy() |
第八步:迭代计算,反向传播,计算隐藏层权重向量梯度
\[ h=x\cdot w1 \Rightarrow dh=x\cdot dw1 \]
\[ dloss =tr((2(y_{pred} - y)\cdot (w2)^{T})^{T}* 1(h\geq 0)^{T}\cdot dh)\\ =tr((2(y_{pred} - y)\cdot (w2)^{T})^{T}* 1(h\geq 0)^{T}\cdot x\cdot dw1) \]
所以Jacobian矩阵为\((2(y_{pred} - y)\cdot (w2)^{T})^{T}* 1(h\geq 0)^{T}\cdot x\),梯度矩阵为
\[ \bigtriangledown_{w1}f(w1)=((2(y_{pred} - y)\cdot (w2)^{T})^{T}* 1(h\geq 0)^{T}\cdot x)^{T}\\ =x^{T}\cdot 1(h\geq 0)* 2(y_{pred} - y)\cdot (w2)^{T} \]
1 | grad_w1 = x.T.dot(grad_h) |
第九步:迭代计算,反向传播,更新权重矩阵
1 | # Update weights |
推导二
cs231n
课程Putting it together: Minimal Neural Network Case Study中实现了一个2
层神经网络
1 | N = 100 # number of points per class |
第一步:设置批量输入数据和输出数据
- 批量数据大小\(N=100\)
- 数据维数\(D=2\)
- 类别数\(K=3\)
- 输入数据\(X\in R^{N\times D}\)
- 输出数据\(y\in R^{N\times K}\)
1 | N = 100 # number of points per class |
第二步:初始化权重参数
- 隐藏层神经元个数\(h=100\)
- 隐藏层权重矩阵\(W\in R^{D\times h}\)
- 隐藏层偏置向量\(b\in R^{1\times h}\)
- 输出层权重矩阵\(W2\in R^{h\times K}\)
- 输出层偏置向量\(b2\in R^{1\times K}\)
1 | # initialize parameters randomly |
第三步:设置学习率和正则化强度
1 | # some hyperparameters |
第四步:迭代计算,输入批量数据到神经网络,进行前向传播
\[ hiddenLayer = max(X\cdot W+b, 0)\\ scores = hiddenLayer\cdot W2+b2 \]
1 | # evaluate class scores, [N x K] |
第四步:迭代计算,计算损失值
\[ expScores = exp(scores)\\ probs = \frac {expScores}{expScores\cdot 1}\\ correctLogProbs = -\ln probs_{y}\in R^{N\times 1}\\ dataLoss=\frac {1}{N} 1^{T}\cdot correctLogProbs\\ regLoss=0.5\cdot reg\cdot ||W||^{2}+0.5\cdot reg\cdot ||W2||^{2}\\ loss = dataLoss+regLoss \]
\(1\)表示求和向量:\([1,1,...]^{T}\)
\(probs_{y}\)表示每行正确类别的概率
1 | # compute the class probabilities |
第五步:迭代计算,反向传播,计算输出层输入向量梯度
\[ scores_{y}=scores*Y\cdot 1\\ expscores_{y}=exp(scores*Y\cdot 1)\ \ \ \ expscores=exp(scores)\ \ \ \ expscores_{sum}=exp(scores)\cdot 1\\ probs_{y}=\frac {expscores_{y}}{expscores_{sum}}\ \ \ \ probs=\frac {expscores}{expscores_{sum}} \]
\[ dataloss=-\frac {1}{N} 1^{T}\cdot \ln (probs_{y}) =-\frac {1}{N} 1^{T}\cdot \ln \frac {expscores_{y}}{expscores_{sum}}\\ =-\frac {1}{N} 1^{T}\cdot (\ln expscores_{y} -\ln expscores_{sum})\\ =-\frac {1}{N} 1^{T}\cdot (scores*Y\cdot 1 -\ln expscores_{sum}) \]
\[ d(dataloss)=tr(d(-\frac {1}{N} (1^{T}\cdot scores*Y\cdot 1 -1^{T}\cdot \ln expscores_{sum})))\\ =tr(d(-\frac {1}{N} (1^{T}\cdot scores*Y\cdot 1))) - tr(d(-\frac {1}{N} (1^{T}\cdot \ln expscores_{sum}))) \]
\[ tr(d(-\frac {1}{N} (1^{T}\cdot scores*Y\cdot 1)))= tr(-\frac {1}{N} (1^{T}\cdot dscores*Y\cdot 1))\\ =tr(-\frac {1}{N} (dscores^{T}\cdot Y)) =tr(-\frac {1}{N} Y^{T}\cdot dscores) \]
\[ tr(d(-\frac {1}{N} (1^{T}\cdot \ln expscores_{sum}))) =tr(-\frac {1}{N} (1^{T}\cdot expscores_{sum}^{-1}\cdot dexpscores_{sum}))\\ =tr(-\frac {1}{N} \frac {(1^{T}\cdot dexpscores_{sum})}{expscores_{sum}}) =tr(-\frac {1}{N} \frac {(1^{T}\cdot exp(scores)* dscores\cdot 1)}{expscores_{sum}})\\ =tr(-\frac {1}{N} \frac {exp(scores)^{T}\cdot dscores}{expscores_{sum}}) =tr(-\frac {1}{N} (\frac {exp(scores)}{expscores_{sum}})^{T}\cdot dscores) =tr(-\frac {1}{N} probs^{T}\cdot dscores) \]
\[ \Rightarrow d(dataloss)= tr(-\frac {1}{N} Y^{T}\cdot dscores)-tr(-\frac {1}{N} probs^{T}\cdot dscores)\\ =tr(\frac {1}{N} (probs^{T} - Y^{T})\cdot dscores) \]
所以\(Jacobian\)矩阵为\(D_{scores}f(scores)=probs^{T} - Y^{T}\),梯度矩阵为\(\bigtriangledown_{scores}f(scores)=probs - Y\)
- \(Y\)大小为\(N\times K\),每行仅正确类别位置为1,其余为0
- \(1\)是求和向量,\([1,1,...]^{T}\)
计算softmax
分类的交叉熵损失关于输出层输入向量梯度,这一部分想了好久,主要问题是关于矩阵除法和逐元素除法(标量除法)的分别,感觉还是先对单个数据进行求梯度再泛化比较方便
1 | # compute the gradient on scores |
第六步:迭代计算,反向传播,计算输出层权重矩阵、偏置向量以及隐藏层输出向量梯度
\[ scores = hiddenLayer\cdot W2+b2\\ dscores = dhiddenLayer\cdot W2 + hiddenLayer\cdot dW2 + db2 \]
\[ d(dataloss) =tr(\frac {1}{N} (probs^{T} - Y^{T})\cdot dscores)\\ =tr(\frac {1}{N} (probs^{T} - Y^{T})\cdot (dhiddenLayer\cdot W2 + hiddenLayer\cdot dW2 + db2))\\ =tr(\frac {1}{N} (probs^{T} - Y^{T})\cdot dhiddenLayer\cdot W2)\\ +tr(\frac {1}{N} (probs^{T} - Y^{T})\cdot hiddenLayer\cdot dW2)+ tr(\frac {1}{N} (probs^{T} - Y^{T})\cdot db2) \]
求输出层权重矩阵梯度
\[ d(dataloss)=tr(\frac {1}{N} (probs^{T} - Y^{T})\cdot hiddenLayer\cdot dW2) \]
\[ D_{W2}f(W2)=\frac {1}{N} (probs^{T} - Y^{T})\cdot hiddenLayer\\ \bigtriangledown_{W2}f(W2)=\frac {1}{N} hiddenLayer^{T}\cdot (probs - Y) \]
求输出层偏置向量梯度
\[ d(dataloss)=tr(\frac {1}{N} \sum_{i=1}^{N}(probs_{i}^{T} - Y_{i}^{T})\cdot db2) \]
\[ D_{b2}f(b2)=\frac {1}{N} \sum_{i=1}^{N}(probs_{i}^{T} - Y_{i}^{T})\\ \bigtriangledown_{b2}f(b2)=\frac {1}{N} \sum_{i=1}^{N}(probs_{i} - Y_{i}) \]
对偏置向量还需要注意维数,求和批量数据的偏置向量梯度
求隐藏层输出向量梯度
\[ d(dataloss)=tr(\frac {1}{N} (probs^{T} - Y^{T})\cdot dhiddenLayer\cdot W2) =tr(\frac {1}{N} W2\cdot (probs^{T} - Y^{T})\cdot dhiddenLayer) \]
\[ D_{hiddenLayer}f(hiddenLayer)=\frac {1}{N} W2\cdot (probs^{T} - Y^{T})\\ \bigtriangledown_{hiddenLayer}f(hiddenLayer)=\frac {1}{N} (probs - Y)\cdot (W2)^{T} \]
1 | # backpropate the gradient to the parameters |
第七步:迭代计算,反向传播,计算隐藏层输入向量梯度
\[ hiddenLayer_{in}=X\cdot W+b\\ hiddenLayer = max(0, hiddenLayer_{in})\\ dhiddenLayer = 1(hiddenLayer_{in}\geq 0)* dhiddenLayer_{in} \]
\[ d(dataloss)=tr(\frac {1}{N} W2\cdot (probs^{T} - Y^{T})\cdot dhiddenLayer)\\ =tr(\frac {1}{N} W2\cdot (probs^{T} - Y^{T})\cdot 1(hiddenLayer_{in}\geq 0)* dhiddenLayer_{in})\\ =tr(\frac {1}{N} (W2\cdot (probs^{T} - Y^{T}))^{T} * 1(hiddenLayer_{in}\geq 0)^{T}\cdot dhiddenLayer_{in}) \]
\[ D_{hiddenLayer_{in}}f(hiddenLayer_{in})=\frac {1}{N} (W2\cdot (probs^{T} - Y^{T}))^{T} * 1(hiddenLayer_{in}\geq 0)^{T}\\ \bigtriangledown_{hiddenLayer_{in}}f(hiddenLayer_{in})=\frac {1}{N} ((probs - Y)\cdot (W2)^{T})* 1(hiddenLayer_{in}\geq 0) \]
1 | # backprop the ReLU non-linearity |
第七步:迭代计算,反向传播,计算隐藏层权重向量和偏置向量梯度
\[ hiddenLayer_{in}=X\cdot W+b\\ dhiddenLayer_{in}=X\cdot dW + db \]
\[ d(dataloss)=tr(\frac {1}{N} (W2\cdot (probs^{T} - Y^{T}))^{T} * 1(hiddenLayer_{in}\geq 0)^{T}\cdot dhiddenLayer_{in})\\ =tr(\frac {1}{N} (W2\cdot (probs^{T} - Y^{T}))^{T} * 1(hiddenLayer_{in}\geq 0)^{T}\cdot (X\cdot dW + db)) \]
求隐藏层权重向量梯度
\[ d(dataloss)=tr(\frac {1}{N} (W2\cdot (probs^{T} - Y^{T}))^{T} * 1(hiddenLayer_{in}\geq 0)^{T}\cdot X\cdot dW) \]
\[ D_{W}f(W)=\frac {1}{N} (W2\cdot (probs^{T} - Y^{T}))^{T} * 1(hiddenLayer_{in}\geq 0)^{T}\cdot X\\ \bigtriangledown_{W}f(W)=\frac {1}{N} X^{T}\cdot ((probs - Y)\cdot (W2)^{T})* 1(hiddenLayer_{in}\geq 0) \]
求隐藏层偏置向量梯度
\[ d(dataloss)=tr(\frac {1}{N} \sum_{i=1}^{N}(W2\cdot (probs^{T} - Y^{T}))^{T} * 1(hiddenLayer_{in}\geq 0)^{T}\cdot db) \]
\[ D_{W}f(W)=\frac {1}{N} \sum_{i=1}^{N}(W2\cdot (probs^{T} - Y^{T}))^{T} * 1(hiddenLayer_{in}\geq 0)^{T}\\ \bigtriangledown_{W}f(W)=\frac {1}{N} \sum_{i=1}^{N}((probs - Y)\cdot (W2)^{T})* 1(hiddenLayer_{in}\geq 0) \]
对偏置向量还需要注意维数,求和批量数据的偏置向量梯度
1 | # finally into W,b |
第八步:迭代计算,反向传播,计算正则化梯度
\[ regLoss=0.5\cdot reg\cdot ||W||^{2}+0.5\cdot reg\cdot ||W2||^{2}\\ d(regLoss)=reg\cdot W\cdot dW+reg\cdot W2\cdot dW2 \]
1 | # add regularization gradient contribution |
第九步:迭代计算,反向传播,更新权重矩阵和偏置向量
1 | # perform a parameter update |