cs231n课程推荐的计算方式是先进行单个元素求导，再逐步泛化到批量数据求梯度，参考

Derivatives, Backpropagation, and Vectorization - CS231n

Vector, Matrix, and Tensor Derivatives - CS231n

Backpropagation and Neural Networks - CS231n

[矩阵求导]神经网络反向传播梯度计算数学原理

Jacobian矩阵和梯度矩阵

## 推导一

PyTorch教程Learning PyTorch with Examples给出了一个2层神经网络的numpy实现

• 批量数据大小$$N=64$$
• 输入层神经元个数$$D_{in}=1000$$
• 隐藏层神经元个数$$H=100$$
• 输出层神经元个数$$D_{out}=10$$

• 输入数据$$x\in R^{N\times D_{in}}$$
• 输出数据$$y\in R^{N\times D_{out}}$$
• 隐藏层权重矩阵$$w1\in R^{D_{in}\times H}$$
• 输出层权重矩阵$$w2\in R^{H\times D_{out}}$$

$h=x\cdot w1\\ h_{relu}=max(0, h)\\ y_{pred}=h_{relu}\cdot w2$

$loss=\begin{Vmatrix} y_{pred}-y \end{Vmatrix}^{2}$

$$y_{pred}-y=X$$$$X$$大小为$$N\times D_{out}$$，则

$loss=\begin{Vmatrix} X \end{Vmatrix}^{2}= (vec(X))^{T}\cdot vec(X)$

$dloss=d(tr(loss))=tr(dloss)=tr(d((vec(X))^{T}\cdot vec(X) ))\\ =tr(d(vec(X)^{T})\cdot vec(X)+vec(X)^{T}\cdot dvec(X))\\ =tr(d(vec(X)^{T})\cdot vec(X))+tr(vec(X)^{T}\cdot dvec(X))\\ =tr((dvec(X))^{T}\cdot vec(X))+tr(vec(X)^{T}\cdot dvec(X))\\ =tr((vec(X))^{T}\cdot dvec(X))+tr(vec(X)^{T}\cdot dvec(X))\\ =tr(2(vec(X))^{T}\cdot dvec(X))\\ =tr(2X^{T}\cdot dX)$

$y_{pred}=h_{relu}\cdot w2 \Rightarrow dy_{pred}=dh_{relu}\cdot w2+h_{relu}\cdot dw2$

$dloss=tr(2X^{T}\cdot dX) =tr(2(y_{pred} - y)^{T}\cdot d((y_{pred} - y))) =tr(2(y_{pred} - y)^{T}\cdot dy_{pred})\\ =tr(2(y_{pred} - y)^{T}\cdot (dh_{relu}\cdot w2+h_{relu}\cdot dw2))\\ =tr(2(y_{pred} - y)^{T}\cdot dh_{relu}\cdot w2)+tr(2(y_{pred} - y)^{T}\cdot h_{relu}\cdot dw2)\\ =tr(w2\cdot 2(y_{pred} - y)^{T}\cdot dh_{relu})+tr(2(y_{pred} - y)^{T}\cdot h_{relu}\cdot dw2)$

$h_{relu}=max(0, h) \Rightarrow dh_{relu}=\left\{\begin{matrix} dh & h\geq 0\\ 0 & h < 0 \end{matrix}\right. =1(h\geq 0)*dh$

$dloss=tr(w2\cdot 2(y_{pred} - y)^{T}\cdot dh_{relu})\\ =tr(w2\cdot 2(y_{pred} - y)^{T}\cdot 1(h\geq 0)* dh)\\ =tr((2(y_{pred} - y)\cdot (w2)^{T})^{T}\cdot 1(h\geq 0)* dh)\\ =tr((2(y_{pred} - y)\cdot (w2)^{T})^{T}* 1(h\geq 0)^{T}\cdot dh)$

$\bigtriangledown_{h}f(h)=1(h\geq 0)\cdot 2(y_{pred} - y)\cdot (w2)^{T}$

$h=x\cdot w1 \Rightarrow dh=x\cdot dw1$

$dloss =tr((2(y_{pred} - y)\cdot (w2)^{T})^{T}* 1(h\geq 0)^{T}\cdot dh)\\ =tr((2(y_{pred} - y)\cdot (w2)^{T})^{T}* 1(h\geq 0)^{T}\cdot x\cdot dw1)$

$\bigtriangledown_{w1}f(w1)=((2(y_{pred} - y)\cdot (w2)^{T})^{T}* 1(h\geq 0)^{T}\cdot x)^{T}\\ =x^{T}\cdot 1(h\geq 0)* 2(y_{pred} - y)\cdot (w2)^{T}$

## 推导二

cs231n课程Putting it together: Minimal Neural Network Case Study中实现了一个2层神经网络

• 批量数据大小$$N=100$$
• 数据维数$$D=2$$
• 类别数$$K=3$$
• 输入数据$$X\in R^{N\times D}$$
• 输出数据$$y\in R^{N\times K}$$

• 隐藏层神经元个数$$h=100$$
• 隐藏层权重矩阵$$W\in R^{D\times h}$$
• 隐藏层偏置向量$$b\in R^{1\times h}$$
• 输出层权重矩阵$$W2\in R^{h\times K}$$
• 输出层偏置向量$$b2\in R^{1\times K}$$

$hiddenLayer = max(X\cdot W+b, 0)\\ scores = hiddenLayer\cdot W2+b2$

$expScores = exp(scores)\\ probs = \frac {expScores}{expScores\cdot 1}\\ correctLogProbs = -\ln probs_{y}\in R^{N\times 1}\\ dataLoss=\frac {1}{N} 1^{T}\cdot correctLogProbs\\ regLoss=0.5\cdot reg\cdot ||W||^{2}+0.5\cdot reg\cdot ||W2||^{2}\\ loss = dataLoss+regLoss$

$$1$$表示求和向量：$$[1,1,...]^{T}$$

$$probs_{y}$$表示每行正确类别的概率

$scores_{y}=scores*Y\cdot 1\\ expscores_{y}=exp(scores*Y\cdot 1)\ \ \ \ expscores=exp(scores)\ \ \ \ expscores_{sum}=exp(scores)\cdot 1\\ probs_{y}=\frac {expscores_{y}}{expscores_{sum}}\ \ \ \ probs=\frac {expscores}{expscores_{sum}}$

$dataloss=-\frac {1}{N} 1^{T}\cdot \ln (probs_{y}) =-\frac {1}{N} 1^{T}\cdot \ln \frac {expscores_{y}}{expscores_{sum}}\\ =-\frac {1}{N} 1^{T}\cdot (\ln expscores_{y} -\ln expscores_{sum})\\ =-\frac {1}{N} 1^{T}\cdot (scores*Y\cdot 1 -\ln expscores_{sum})$

$d(dataloss)=tr(d(-\frac {1}{N} (1^{T}\cdot scores*Y\cdot 1 -1^{T}\cdot \ln expscores_{sum})))\\ =tr(d(-\frac {1}{N} (1^{T}\cdot scores*Y\cdot 1))) - tr(d(-\frac {1}{N} (1^{T}\cdot \ln expscores_{sum})))$

$tr(d(-\frac {1}{N} (1^{T}\cdot scores*Y\cdot 1)))= tr(-\frac {1}{N} (1^{T}\cdot dscores*Y\cdot 1))\\ =tr(-\frac {1}{N} (dscores^{T}\cdot Y)) =tr(-\frac {1}{N} Y^{T}\cdot dscores)$

$tr(d(-\frac {1}{N} (1^{T}\cdot \ln expscores_{sum}))) =tr(-\frac {1}{N} (1^{T}\cdot expscores_{sum}^{-1}\cdot dexpscores_{sum}))\\ =tr(-\frac {1}{N} \frac {(1^{T}\cdot dexpscores_{sum})}{expscores_{sum}}) =tr(-\frac {1}{N} \frac {(1^{T}\cdot exp(scores)* dscores\cdot 1)}{expscores_{sum}})\\ =tr(-\frac {1}{N} \frac {exp(scores)^{T}\cdot dscores}{expscores_{sum}}) =tr(-\frac {1}{N} (\frac {exp(scores)}{expscores_{sum}})^{T}\cdot dscores) =tr(-\frac {1}{N} probs^{T}\cdot dscores)$

$\Rightarrow d(dataloss)= tr(-\frac {1}{N} Y^{T}\cdot dscores)-tr(-\frac {1}{N} probs^{T}\cdot dscores)\\ =tr(\frac {1}{N} (probs^{T} - Y^{T})\cdot dscores)$

• $$Y$$大小为$$N\times K$$，每行仅正确类别位置为1，其余为0
• $$1$$是求和向量，$$[1,1,...]^{T}$$

$scores = hiddenLayer\cdot W2+b2\\ dscores = dhiddenLayer\cdot W2 + hiddenLayer\cdot dW2 + db2$

$d(dataloss) =tr(\frac {1}{N} (probs^{T} - Y^{T})\cdot dscores)\\ =tr(\frac {1}{N} (probs^{T} - Y^{T})\cdot (dhiddenLayer\cdot W2 + hiddenLayer\cdot dW2 + db2))\\ =tr(\frac {1}{N} (probs^{T} - Y^{T})\cdot dhiddenLayer\cdot W2)\\ +tr(\frac {1}{N} (probs^{T} - Y^{T})\cdot hiddenLayer\cdot dW2)+ tr(\frac {1}{N} (probs^{T} - Y^{T})\cdot db2)$

$d(dataloss)=tr(\frac {1}{N} (probs^{T} - Y^{T})\cdot hiddenLayer\cdot dW2)$

$D_{W2}f(W2)=\frac {1}{N} (probs^{T} - Y^{T})\cdot hiddenLayer\\ \bigtriangledown_{W2}f(W2)=\frac {1}{N} hiddenLayer^{T}\cdot (probs - Y)$

$d(dataloss)=tr(\frac {1}{N} \sum_{i=1}^{N}(probs_{i}^{T} - Y_{i}^{T})\cdot db2)$

$D_{b2}f(b2)=\frac {1}{N} \sum_{i=1}^{N}(probs_{i}^{T} - Y_{i}^{T})\\ \bigtriangledown_{b2}f(b2)=\frac {1}{N} \sum_{i=1}^{N}(probs_{i} - Y_{i})$

$d(dataloss)=tr(\frac {1}{N} (probs^{T} - Y^{T})\cdot dhiddenLayer\cdot W2) =tr(\frac {1}{N} W2\cdot (probs^{T} - Y^{T})\cdot dhiddenLayer)$

$D_{hiddenLayer}f(hiddenLayer)=\frac {1}{N} W2\cdot (probs^{T} - Y^{T})\\ \bigtriangledown_{hiddenLayer}f(hiddenLayer)=\frac {1}{N} (probs - Y)\cdot (W2)^{T}$

$hiddenLayer_{in}=X\cdot W+b\\ hiddenLayer = max(0, hiddenLayer_{in})\\ dhiddenLayer = 1(hiddenLayer_{in}\geq 0)* dhiddenLayer_{in}$

$d(dataloss)=tr(\frac {1}{N} W2\cdot (probs^{T} - Y^{T})\cdot dhiddenLayer)\\ =tr(\frac {1}{N} W2\cdot (probs^{T} - Y^{T})\cdot 1(hiddenLayer_{in}\geq 0)* dhiddenLayer_{in})\\ =tr(\frac {1}{N} (W2\cdot (probs^{T} - Y^{T}))^{T} * 1(hiddenLayer_{in}\geq 0)^{T}\cdot dhiddenLayer_{in})$

$D_{hiddenLayer_{in}}f(hiddenLayer_{in})=\frac {1}{N} (W2\cdot (probs^{T} - Y^{T}))^{T} * 1(hiddenLayer_{in}\geq 0)^{T}\\ \bigtriangledown_{hiddenLayer_{in}}f(hiddenLayer_{in})=\frac {1}{N} ((probs - Y)\cdot (W2)^{T})* 1(hiddenLayer_{in}\geq 0)$

$hiddenLayer_{in}=X\cdot W+b\\ dhiddenLayer_{in}=X\cdot dW + db$

$d(dataloss)=tr(\frac {1}{N} (W2\cdot (probs^{T} - Y^{T}))^{T} * 1(hiddenLayer_{in}\geq 0)^{T}\cdot dhiddenLayer_{in})\\ =tr(\frac {1}{N} (W2\cdot (probs^{T} - Y^{T}))^{T} * 1(hiddenLayer_{in}\geq 0)^{T}\cdot (X\cdot dW + db))$

$d(dataloss)=tr(\frac {1}{N} (W2\cdot (probs^{T} - Y^{T}))^{T} * 1(hiddenLayer_{in}\geq 0)^{T}\cdot X\cdot dW)$

$D_{W}f(W)=\frac {1}{N} (W2\cdot (probs^{T} - Y^{T}))^{T} * 1(hiddenLayer_{in}\geq 0)^{T}\cdot X\\ \bigtriangledown_{W}f(W)=\frac {1}{N} X^{T}\cdot ((probs - Y)\cdot (W2)^{T})* 1(hiddenLayer_{in}\geq 0)$

$d(dataloss)=tr(\frac {1}{N} \sum_{i=1}^{N}(W2\cdot (probs^{T} - Y^{T}))^{T} * 1(hiddenLayer_{in}\geq 0)^{T}\cdot db)$

$D_{W}f(W)=\frac {1}{N} \sum_{i=1}^{N}(W2\cdot (probs^{T} - Y^{T}))^{T} * 1(hiddenLayer_{in}\geq 0)^{T}\\ \bigtriangledown_{W}f(W)=\frac {1}{N} \sum_{i=1}^{N}((probs - Y)\cdot (W2)^{T})* 1(hiddenLayer_{in}\geq 0)$

$regLoss=0.5\cdot reg\cdot ||W||^{2}+0.5\cdot reg\cdot ||W2||^{2}\\ d(regLoss)=reg\cdot W\cdot dW+reg\cdot W2\cdot dW2$