卷积神经网络推导-批量图片矩阵计算

之前推导LeNet-5网络输入单个图像数据的前后向传播,现在实现批量图像数据的前后向传播

计算符号

  • \(N\)表示批量数据
  • \(C\)表示深度
  • \(H\)表示高度
  • \(W\)表示宽度
  • \(output^{x}\)表示第\(x\)层输出数据体

通道转换

使用OpenCVPIL读取一张图片,保存为numpy ndarray结构,图像尺寸前3位分别表示长度、宽度和深度:\([H, W, C]\)

批量处理图像数据相当于在深度上进行累加,为方便计算,先转换成深度、长度和宽度:\([C, H, W]\)

网络输出

对输入层

\[ X\in R^{N\times C_{0}\times H_{0}\times W_{0}} = R^{N\times 1\times 32\times 32} \]

对卷积层\(C1\)

\[ output^{(1)}\in R^{N\times C_{1}\times H_{1}\times W_{1}} = R^{N\times 6\times 28\times 28} \]

对池化层\(S2\)

\[ output^{(2)}\in R^{N\times C_{2}\times H_{2}\times W_{2}} = R^{N\times 6\times 14\times 14} \]

对卷积层\(C3\)

\[ output^{(3)}\in R^{N\times C_{3}\times H_{3}\times W_{3}} = R^{N\times 16\times 10\times 10} \]

对池化层\(S4\)

\[ output^{(4)}\in R^{N\times C_{4}\times H_{4}\times W_{4}} = R^{N\times 16\times 5\times 5} \]

对卷积层\(C5\)

\[ output^{(5)}\in R^{N\times C_{5}\times H_{5}\times W_{5}} = R^{N\times 120} \]

对全连接层\(F6\)

\[ output^{(6)}\in R^{N\times C_{6}\times H_{6}\times W_{6}} = R^{N\times 84} \]

对输出层\(F7\)

\[ output^{(7)}\in R^{N\times C_{7}\times H_{7}\times W_{7}} = R^{N\times 10} \]

前向计算

输入层

\[ X\in R^{N\times 1\times 32\times 32} \]

卷积层\(C1\)

共6个滤波器,每个滤波器空间尺寸为\(5\times 5\),步长为\(1\), 零填充为\(0\)

输出空间尺寸为\((32-5-2\cdot 0)/1+1=28\)

所以单次卷积操作的向量大小为\(1\cdot 5\cdot 5=25\),单个滤波器有\(28\cdot 28=784\)个局部连接,\(N\)张图片有\(N\cdot 784\)个局部连接

\[ a^{(0)}\in R^{(N\cdot 784)\times 25}\\ W^{(1)}\in R^{25\times 6}\\ b^{(1)}\in R^{1\times 6}\\ \Rightarrow z^{(1)}=a^{(0)}\cdot W^{(1)}+b^{(1)} \in R^{(N\cdot 784)\times 6}\\ \Rightarrow y^{(1)}=relu(z^{(1)}) \in R^{(N\cdot 784)\times 6}\\ \]

重置\(y^{(1)}\)大小,输出数据体\(output^{(1)}\in R^{N\times 6\times 28\times 28}\)

池化层\(S2\)

执行\(\max\)运算,每个滤波器空间尺寸\(2\times 2\),步长为\(2\)

输出空间尺寸为\((28-2)/2+1=14\)

所以单次\(\max\)操作的向量大小为\(2\cdot 2=4\),单个滤波器有\(6\cdot 14\cdot 14=1176\)个局部连接, \(N\)张图片有\(N\cdot 1176\)个局部连接

\[ a^{(1)}\in R^{(N\cdot 1176)\times 4}\\ z^{(2)}=\max (a^{(1)})\in R^{(N\cdot 1176)} \]

\(argz^{(2)} = argmax(a^{(1)})\in R^{(N\cdot 1176)}\),每个值表示单次局部连接最大值下标

重置\(z^{(2)}\)大小,输出数据体\(output^{(2)}\in R^{N\times 6\times 14\times 14}\)

卷积层\(C3\)

共16个滤波器,每个滤波器空间尺寸为\(5\times 5\),步长为\(1\), 零填充为\(0\)

输出空间尺寸为\((14-5+2\cdot 0)/1+1=10\)

所以单次卷积操作的向量大小为\(5\cdot 5\cdot 6=150\),单个滤波器有\(10\cdot 10=100\)个局部连接, \(N\)张图片有\(N\cdot 100\)个局部连接

\[ a^{(2)}\in R^{(N\cdot 100)\times 150}\\ W^{(3)}\in R^{150\times 16}\\ b^{(3)}\in R^{1\times 16}\\ \Rightarrow z^{(3)}=a^{(2)}\cdot W^{(3)}+b^{(3)} \in R^{(N\cdot 100)\times 16}\\ \Rightarrow y^{(3)}=relu(z^{(3)}) \in R^{(N\cdot 100)\times 16}\\ \]

重置\(y^{(3)}\)大小,输出数据体\(output^{(3)}\in R^{N\times 16\times 10\times 10}\)

池化层\(S4\)

执行\(\max\)运算,每个滤波器空间尺寸\(2\times 2\),步长为\(2\)

输出空间尺寸为\((10-2)/2+1=5\)

所以单次\(\max\)操作的向量大小为\(2\cdot 2=4\),单个滤波器有\(16\cdot 5\cdot 5=400\)个局部连接, \(N\)张图片有\(N\cdot 400\)个局部连接

\[ a^{(3)}\in R^{(N\cdot 400)\times 4}\\ z^{(4)}=\max (a^{(3)})\in R^{(N\cdot 400)} \]

\(argz^{(4)} = argmax(a^{(3)})\in R^{(N\cdot 400)}\),每个值表示单次局部连接最大值下标

重置\(z^{(4)}\)大小,输出数据体\(output^{(4)}\in R^{N\times 16\times 5\times 5}\)

卷积层\(C5\)

共120个滤波器,每个滤波器空间尺寸为\(5\times 5\),步长为\(1\), 零填充为\(0\)

输出空间尺寸为\((5-5+2\cdot 0)/1+1=1\)

所以单次卷积操作的向量大小为\(5\cdot 5\cdot 16=400\),单个滤波器有\(1\cdot 1=1\)个局部连接, \(N\)张图片有\(N\cdot 1\)个局部连接

\[ a^{(4)}\in R^{N\times 400}\\ W^{(5)}\in R^{400\times 120}\\ b^{(5)}\in R^{1\times 120}\\ \Rightarrow z^{(5)}=a^{(4)}\cdot W^{(5)}+b^{(5)}\in R^{N\times 120}\\ \Rightarrow y^{(5)}=relu(z^{(5)})\in R^{N\times 120}\\ \]

输出数据体\(output^{(5)}\in R^{N\times 120}\)

全连接层\(F6\)

神经元个数为\(84\)

\[ a^{(5)}=y^{(5)}\in R^{N\times 120}\\ W^{(6)}\in R^{120\times 84}\\ b^{(6)}\in R^{1\times 84}\\ \Rightarrow z^{(6)}=a^{(5)}\cdot W^{(6)}+b^{(6)}\in R^{N\times 84}\\ \Rightarrow y^{(6)}=relu(z^{(6)})\in R^{N\times 84}\\ \]

输出数据体\(output^{(6)}\in R^{N\times 84}\)

输出层\(F7\)

神经元个数为\(10\)

\[ a^{(6)}=y^{(6)}\in R^{N\times 84}\\ W^{(7)}\in R^{84\times 10}\\ b^{(7)}\in R^{1\times 10}\\ \Rightarrow z^{(7)}=a^{(6)}\cdot W^{(7)}+b^{(7)}\in R^{N\times 10}\\ \]

输出数据体\(output^{(7)}\in R^{N\times 10}\)

分类概率

\[ probs=h(z^{(7)})=\frac {exp(z^{(7)})}{exp(z^{(7)})\cdot A\cdot B^{T}} \]

\(A\in R^{10\times 1}, B\in R^{10\times 1}\)都是全\(1\)向量

损失值

\[ dataLoss = -\frac {1}{N} 1^{T}\cdot \ln \frac {exp(z^{(7)}* Y\cdot A)}{exp(z^{(7)})\cdot A} \]

\[ regLoss = 0.5\cdot reg\cdot (||W^{(1)}||^{2} + ||W^{(3)}||^{2} + ||W^{(5)}||^{2} + ||W^{(6)}||^{2} + ||W^{(7)}||^{2}) \]

\[ J(z^{(7)})=dataLoss + regLoss \]

\(Y\in R^{N\times 10}\),仅有正确类别为1, 其余为0

反向传播

输出层\(F7\)

求输入向量\(z^{(7)}\)梯度

\[ d(dataloss)=d(-\frac {1}{N} 1^{T}\cdot \ln \frac {exp(z^{(7)}* Y\cdot A)}{exp(z^{(7)})\cdot A}) =tr(-\frac {1}{N} (probs^{T} - Y^{T})\cdot dz^{(7)}) \]

\[ \Rightarrow D_{z^{(7)}}f(z^{(7)})=\frac {1}{N} (probs^{T} - Y^{T})\in R^{10\times N}\\ \Rightarrow \bigtriangledown_{z^{(7)}}f(z^{(7)})=\frac {1}{N} (probs - Y)\in R^{N\times 10} \]

其他梯度

\[ z^{(7)}=a^{(6)}\cdot W^{(7)}+b^{(7)} \\ dz^{(7)}=da^{(6)}\cdot W^{(7)} + a^{(6)}\cdot dW^{(7)} + db^{(7)} \]

\[ d(dataloss)=tr(D_{z^{(7)}}f(z^{(7)})\cdot dz^{(7)}) =tr(D_{z^{(7)}}f(z^{(7)})\cdot (da^{(6)}\cdot W^{(7)} + a^{(6)}\cdot dW^{(7)} + db^{(7)}))\\ =tr(D_{z^{(7)}}f(z^{(7)})\cdot da^{(6)}\cdot W^{(7)}) +tr(D_{z^{(7)}}f(z^{(7)})\cdot a^{(6)}\cdot dW^{(7)}) +tr(D_{z^{(7)}}f(z^{(7)})\cdot db^{(7)})) \]

求权重矩阵\(W^{(7)}\)梯度

\[ d(dataloss)=tr(D_{z^{(7)}}f(z^{(7)})\cdot a^{(6)}\cdot dW^{(7)}) \]

\[ \Rightarrow D_{W^{(7)}}f(W^{(7)})=D_{z^{(7)}}f(z^{(7)})\cdot a^{(6)}\\ \Rightarrow \bigtriangledown_{W^{(7)}}f(W^{(7)}) =(a^{(6)})^{T}\cdot \bigtriangledown_{z^{(7)}}f(z^{(7)}) =R^{84\times 10}\cdot R^{N\times 10} =R^{84\times 10} \]

求偏置向量\(b^{(7)}\)梯度

\[ d(dataloss)=tr(\frac {1}{M} \sum_{i=1}^{M} D_{z^{(7)}}f(z^{(7)})\cdot db^{(7)})) \]

\[ \Rightarrow D_{b^{(7)}}f(b^{(7)})=\frac {1}{M} \sum_{i=1}^{M} D_{z^{(7)}}f(z^{(7)})\\ \Rightarrow \bigtriangledown_{b^{(7)}}f(b^{(7)})=\frac {1}{M} \sum_{i=1}^{M} \bigtriangledown_{z^{(7)}}f(z^{(7)})\in R^{1\times 10} \]

\(M=N\),表示\(dz^{(7)}\)的行数

求上一层输出向量\(a^{(6)}\)梯度

\[ d(dataloss)=tr(D_{z^{(7)}}f(z^{(7)})\cdot da^{(6)}\cdot W^{(7)}) =tr(W^{(7)}\cdot D_{z^{(7)}}f(z^{(7)})\cdot da^{(6)}) \]

\[ \Rightarrow D_{a^{(6)}}f(a^{(6)})=W^{(7)}\cdot D_{z^{(7)}}f(z^{(7)})\\ \Rightarrow \bigtriangledown_{a^{(6)}}f(a^{(6)})=\bigtriangledown_{z^{(7)}}f(z^{(7)})\cdot (W^{(7)})^{T}=R^{N\times 10}\cdot R^{10\times 84}=R^{N\times 84} \]

全连接层\(F6\)

求输入向量\(z^{(6)}\)梯度

\[ a^{(6)}=y^{(6)}=relu(z^{(6)})\\ da^{(6)}=1(z^{(6)}\geq 0)* dz^{(6)} \]

\[ d(dataloss)=tr(D_{a^{(6)}}f(a^{(6)}) da^{(6)})=tr(D_{a^{(6)}}f(a^{(6)})\cdot (1(z^{(6)}\geq 0)* dz^{(6)}))\\ =tr(D_{a^{(6)}}f(a^{(6)})* 1(z^{(6)}\geq 0)^{T}\cdot dz^{(6)}) \]

\[ \Rightarrow D_{z^{(6)}}f(z^{(6)})=D_{a^{(6)}}f(a^{(6)})* 1(z^{(6)}\geq 0)^{T}\\ \Rightarrow \bigtriangledown_{z^{(6)}}f(z^{(6)})=\bigtriangledown_{a^{(6)}}f(a^{(6)})* 1(z^{(6)}\geq 0)\in R^{N\times 84} \]

其他梯度

\[ z^{(6)}=a^{(5)}\cdot W^{(6)}+b^{(6)} \\ dz^{(6)}=da^{(5)}\cdot W^{(6)}+a^{(5)}\cdot dW^{(6)}+db^{(6)} \]

\[ d(dataloss)=tr(D_{z^{(6)}}f(z^{(6)})\cdot dz^{(6)})\\ =tr(D_{z^{(6)}}f(z^{(6)})\cdot (da^{(5)}\cdot W^{(6)} + a^{(5)}\cdot dW^{(6)} + db^{(6)}))\\ =tr(D_{z^{(6)}}f(z^{(6)})\cdot da^{(5)}\cdot W^{(6)}) +tr(D_{z^{(6)}}f(z^{(6)})\cdot a^{(5)}\cdot dW^{(6)}) +tr(D_{z^{(6)}}f(z^{(6)})\cdot db^{(6)})) \]

求权重矩阵\(w^{(6)}\)梯度

\[ d(dataloss)=tr(D_{z^{(6)}}f(z^{(6)})\cdot a^{(5)}\cdot dW^{(6)}) \]

\[ \Rightarrow D_{W^{(6)}}f(W^{(6)})=D_{z^{(6)}}f(z^{(6)})\cdot a^{(5)}\\ \Rightarrow \bigtriangledown_{W^{(6)}}f(W^{(6)}) =(a^{(5)})^{T}\cdot \bigtriangledown_{z^{(6)}}f(z^{(6)}) =R^{120\times N}\cdot R^{N\times 84} =R^{120\times 84} \]

求偏置向量\(b^{(6)}\)梯度

\[ d(dataloss)=tr(\frac {1}{M} \sum_{i=1}^{M} D_{z^{(6)}}f(z^{(6)})\cdot db^{(6)})) \]

\[ \Rightarrow D_{b^{(6)}}f(b^{(6)}) =\frac {1}{M} \sum_{i=1}^{M} D_{z^{(6)}}f(z^{(6)})\\ \Rightarrow \bigtriangledown_{b^{(6)}}f(b^{(6)}) =\frac {1}{M} \sum_{i=1}^{M} \bigtriangledown_{z^{(6)}}f(z^{(6)}) \in R^{1\times 84} \]

\(M=N\),表示\(dz^{(6)}\)的行数

求上一层输出向量\(a^{(5)}\)梯度

\[ d(dataloss)=tr(D_{z^{(6)}}f(z^{(6)})\cdot da^{(5)}\cdot W^{(6)}) =tr(W^{(6)}\cdot D_{z^{(6)}}f(z^{(6)})\cdot da^{(5)}) \]

\[ \Rightarrow D_{a^{(5)}}f(a^{(5)}) =W^{(6)}\cdot D_{z^{(6)}}f(z^{(6)})\\ \Rightarrow \bigtriangledown_{a^{(5)}}f(a^{(5)}) =\bigtriangledown_{z^{(6)}}f(z^{(6)})\cdot (W^{(6)})^{T} =R^{N\times 84}\cdot R^{84\times 120} =R^{N\times 120} \]

卷积层\(C5\)

求输入向量\(z^{(5)}\)梯度

\[ a^{(5)}=y^{(5)}=relu(z^{(5)})\\ da^{(5)}=1(z^{(5)}\geq 0)* dz^{(5)}\\ \]

\[ d(dataloss)=tr(D_{a^{(5)}}f(a^{(5)}) da^{(5)})=tr(D_{a^{(5)}}f(a^{(5)})\cdot (1(z^{(5)}\geq 0)* dz^{(5)}))\\ =tr(D_{a^{(5)}}f(a^{(5)})* 1(z^{(5)}\geq 0)^{T}\cdot dz^{(5)}) \]

\[ \Rightarrow D_{z^{(5)}}f(z^{(5)}) =D_{a^{(5)}}f(a^{(5)})* 1(z^{(5)}\geq 0)^{T}\\ \Rightarrow \bigtriangledown_{z^{(5)}}f(z^{(5)}) =\bigtriangledown_{a^{(5)}}f(a^{(5)})* 1(z^{(5)}\geq 0) \in R^{N\times 120} \]

其他梯度

\[ z^{(5)}=a^{(4)}\cdot W^{(5)}+b^{(5)} \\ dz^{(5)}=da^{(4)}\cdot W^{(5)}+a^{(4)}\cdot dW^{(5)}+db^{(5)} \]

\[ d(dataloss)=tr(D_{z^{(5)}}f(z^{(5)})\cdot dz^{(5)}) =tr(D_{z^{(5)}}f(z^{(5)})\cdot (da^{(4)}\cdot W^{(5)} + a^{(4)}\cdot dW^{(5)} + db^{(5)}))\\ =tr(D_{z^{(5)}}f(z^{(5)})\cdot da^{(4)}\cdot W^{(5)}) +tr(D_{z^{(5)}}f(z^{(5)})\cdot a^{(4)}\cdot dW^{(5)}) +tr(D_{z^{(5)}}f(z^{(5)})\cdot db^{(5)})) \]

求权重矩阵\(W^{(5)}\)梯度

\[ d(dataloss)=tr(D_{z^{(5)}}f(z^{(5)})\cdot a^{(4)}\cdot dW^{(5)}) \]

\[ \Rightarrow D_{W^{(5)}}f(W^{(5)}) =D_{z^{(5)}}f(z^{(5)})\cdot a^{(4)}\\ \Rightarrow \bigtriangledown_{W^{(5)}}f(W^{(5)}) =(a^{(4)})^{T}\cdot \bigtriangledown_{z^{(5)}}f(z^{(5)}) =R^{400\times N}\cdot R^{N\times 120} =R^{400\times 120} \]

求偏置向量\(b^{(5)}\)梯度

\[ d(dataloss)=tr(\frac {1}{M} \sum_{i=1}^{M} D_{z^{(5)}}f(z^{(5)})\cdot db^{(5)}) \]

\[ \Rightarrow D_{b^{(5)}}f(b^{(5)})=\frac {1}{M} \sum_{i=1}^{M} D_{z^{(5)}}f(z^{(5)})\\ \Rightarrow \bigtriangledown_{b^{(5)}}f(b^{(5)}) =\frac {1}{M} \sum_{i=1}^{M} \bigtriangledown_{z^{(5)}}f(z^{(5)}) \in R^{1\times 120} \]

\(M=N\), 表示\(dz^{(5)}\)的行数

求上一层输出向量\(a^{(4)}\)梯度

\[ d(dataloss)=tr(D_{z^{(5)}}f(z^{(5)})\cdot da^{(4)}\cdot W^{(5)}) =tr(W^{(5)}\cdot D_{z^{(5)}}f(z^{(5)})\cdot da^{(4)}) \]

\[ \Rightarrow D_{a^{(4)}}f(a^{(4)}) =W^{(5)}\cdot D_{z^{(5)}}f(z^{(5)})\\ \Rightarrow \bigtriangledown_{a^{(4)}}f(a^{(4)}) =\bigtriangledown_{z^{(5)}}f(z^{(5)})\cdot (W^{(5)})^{T} =R^{N\times 120}\cdot R^{120\times 400} =R^{N\times 400} \]

池化层\(S4\)

计算\(z^{(4)}\)梯度

因为\(a^{(4)}\in R^{N\times 400}\)\(output^{(4)}\in R^{N\times 16\times 5\times 5}\)\(z^{(4)}\in R^{(N\cdot 400)}\),卷积层\(C5\)滤波器空间尺寸为\(5\times 5\),和激活图大小一致,所以\(z^{(4)}\)梯度是\(a^{(4)}\)梯度矩阵的向量化

\[ dz^{(4)} = dvec(a^{(4)}) \]

\[ \Rightarrow D_{z^{(4)}}f(z^{(4)})=vec(D_{a^{(4)}}f(a^{(4)}))\\ \Rightarrow \bigtriangledown_{z^{(4)}}f(z^{(4)})=vec(\bigtriangledown_{a^{(4)}}f(a^{(4)})) \]

上一层输出向量\(a^{(3)}\)梯度

\[ z^{(4)}=\max (a^{(3)})\\ dz^{(4)}=1(a^{(3)}\ is\ the\ max)* da^{(3)} \]

配合\(argz^{(4)}\),最大值梯度和\(z^{(4)}\)一致,其余梯度为\(0\)

\[ d(dataloss)=tr(D_{z^{(4)}}f(z^{(4)}) dz^{(4)})\\ =tr(D_{z^{(4)}}f(z^{(4)})\cdot 1(a^{(3)}\ is\ the\ max)* da^{(3)}) =tr(D_{z^{(4)}}f(z^{(4)})* 1(a^{(3)}\ is\ the\ max)^{T}\cdot da^{(3)} \]

\[ \Rightarrow D_{a^{(3)}}f(a^{(3)})=D_{z^{(4)}}f(z^{(4)})* 1(a^{(3)}\ is\ the\ max)^{T}\\ \Rightarrow \bigtriangledown_{a^{(3)}}f(a^{(3)}) =\bigtriangledown_{z^{(4)}}f(z^{(4)})* 1(a^{(3)}\ is\ the\ max) \in R^{(N\cdot 400)\times 4} \]

卷积层\(C3\)

计算\(y^{(3)}\)梯度

因为\(a^{(3)}\in R^{(N\cdot 400)\times 4}\)\(output^{(3)}\in R^{N\times 16\times 10\times 10}\)\(y^{(3)}\in R^{(N\cdot 100)\times 16}\),池化层\(S4\)滤波器空间尺寸为\(5\times 5\),步长为\(1\),按照采样顺序将\(a^{(3)}\)梯度重置回\(output^{(3)}\)梯度,再重置为\(y^{(3)}\)梯度

求输入向量\(z^{(3)}\)梯度

\[ y^{(3)} = relu(z^{(3)})\\ dy^{(3)} = 1(z^{(3)} \geq 0)*dz^{(3)} \]

\[ d(dataloss) =tr(D_{y^{(3)}}f(y^{(3)})\cdot dy^{(3)})\\ =tr(D_{y^{(3)}}f(y^{(3)})\cdot (1(z^{(3)} \geq 0)*dz^{(3)}))\\ =tr(D_{y^{(3)}}f(y^{(3)})* 1(z^{(3)} \geq 0)^{T}\cdot dz^{(3)}) \]

\[ \Rightarrow D_{z^{(3)}}f(z^{(3)}) =D_{y^{(3)}}f(y^{(3)})* 1(z^{(3)} \geq 0)^{T}\\ \Rightarrow \bigtriangledown_{z^{(3)}}f(z^{(3)}) =\bigtriangledown_{y^{(3)}}f(y^{(3)})* 1(z^{(3)} \geq 0) \in R^{(N\cdot 100)\times 16} \]

其他梯度

\[ z^{(3)}=a^{(2)}\cdot W^{(3)}+b^{(3)} \\ dz^{(3)}=da^{(2)}\cdot W^{(3)}+a^{(2)}\cdot dW^{(3)}+db^{(3)} \]

\[ d(dataloss)=tr(D_{z^{(3)}}f(z^{(3)})\cdot dz^{(3)})\\ =tr(D_{z^{(3)}}f(z^{(3)})\cdot (da^{(2)}\cdot W^{(3)} + a^{(2)}\cdot dW^{(3)} + db^{(3)}))\\ =tr(D_{z^{(3)}}f(z^{(3)})\cdot da^{(2)}\cdot W^{(3)}) +tr(D_{z^{(3)}}f(z^{(3)})\cdot a^{(2)}\cdot dW^{(3)}) +tr(D_{z^{(3)}}f(z^{(3)})\cdot db^{(3)})) \]

求权重矩阵\(W^{(3)}\)梯度

\[ d(dataloss)=tr(D_{z^{(3)}}f(z^{(3)})\cdot a^{(2)}\cdot dW^{(3)}) \]

\[ \Rightarrow D_{W^{(3)}}f(W^{(3)}) =D_{z^{(3)}}f(z^{(3)})\cdot a^{(2)}\\ \Rightarrow \bigtriangledown_{W^{(3)}}f(W^{(3)}) =(a^{(2)})^{T}\cdot \bigtriangledown_{z^{(3)}}f(z^{(3)}) =R^{150\times (N\cdot 100)}\cdot R^{(N\cdot 100)\times 16} =R^{150\times 16} \]

求偏置向量\(b^{(3)}\)梯度

\[ d(dataloss)=\frac {1}{M} \sum_{i=1}^{M} tr(D_{z^{(3)}}f(z^{(3)})\cdot db^{(3)}) \]

\[ \Rightarrow D_{b^{(3)}}f(b^{(3)}) =\frac {1}{M} \sum_{i=1}^{M} D_{z^{(3)}}f(z^{(3)})\\ \Rightarrow \bigtriangledown_{b^{(3)}}f(b^{(3)}) =\frac {1}{M} \sum_{i=1}^{M} \bigtriangledown_{z^{(3)}}f(z^{(3)}) =R^{1\times 16} \]

\(M=N\cdot 100\),表示\(dz^{(3)}\)的行数

求上一层输出向量\(a^{(2)}\)梯度

\[ d(dataloss)=tr(D_{z^{(3)}}f(z^{(3)})\cdot da^{(2)}\cdot W^{(3)}) =tr(W^{(3)}\cdot D_{z^{(3)}}f(z^{(3)})\cdot da^{(2)}) \]

\[ \Rightarrow D_{a^{(2)}}f(a^{(2)}) =W^{(3)}\cdot D_{z^{(3)}}f(z^{(3)})\\ \Rightarrow \bigtriangledown_{a^{(2)}}f(a^{(2)}) =\bigtriangledown_{z^{(3)}}f(z^{(3)})\cdot (W^{(3)})^{T} =R^{(N\cdot 100)\times 16}\cdot R^{16\times 150} =R^{(N\cdot 100)\times 150} \]

池化层\(S2\)

计算\(z^{(2)}\)梯度

因为\(a^{(2)}\in R^{(N\cdot 100)\times 150}\)\(output^{(2)}\in R^{N\times 6\times 14\times 14}\)\(z^{(2)}\in R^{(N\cdot 1176)}\),卷积层\(C3\)滤波器空间尺寸为\(5\times 5\),步长为\(1\),所以按照采样顺序将\(a^{(2)}\)梯度重置回\(output^{(2)}\)梯度,再重置为\(z^{(2)}\)梯度

上一层输出向量\(a^{(1)}\)梯度

\[ z^{(2)}=\max (a^{(1)})\\ dz^{(2)}=1(a^{(1)}\ is\ the\ max)* da^{(1)} \]

配合\(argz^{(2)}\),最大值梯度和\(z^{(2)}\)一致,其余梯度为\(0\)

\[ d(dataloss)=tr(D_{z^{(2)}}f(z^{(2)}) dz^{(2)})\\ =tr(D_{z^{(2)}}f(z^{(2)})\cdot 1(a^{(1)}\ is\ the\ max)* da^{(1)}) =tr(D_{z^{(2)}}f(z^{(2)})* 1(a^{(1)}\ is\ the\ max)^{T}\cdot da^{(1)} \]

\[ \Rightarrow D_{a^{(1)}}f(a^{(1)})=D_{z^{(2)}}f(z^{(2)})* 1(a^{(1)}\ is\ the\ max)^{T}\\ \Rightarrow \bigtriangledown_{a^{(1)}}f(a^{(1)}) =\bigtriangledown_{z^{(2)}}f(z^{(2)})* 1(a^{(1)}\ is\ the\ max) \in R^{(N\cdot 1176)\times 4} \]

卷积层\(C1\)

计算\(y^{(1)}\)梯度

因为\(a^{(1)}\in R^{(N\cdot 1176)\times 4}\)\(output^{(1)}\in R^{N\times 6\times 28\times 28}\)\(y^{(1)}\in R^{(N\cdot 784)\times 6}\),池化层\(S2\)滤波器空间尺寸为\(2\times 2\),步长为\(2\),按照采样顺序将\(a^{(1)}\)梯度重置回\(output^{(1)}\)梯度,再重置为\(y^{(1)}\)梯度

求输入向量\(z^{(1)}\)梯度

\[ y^{(1)} = relu(z^{(1)})\\ dy^{(1)} = 1(z^{(1)} \geq 0)*dz^{(1)} \]

\[ d(dataloss) =tr(D_{y^{(1)}}f(y^{(1)})\cdot dy^{(1)})\\ =tr(D_{y^{(1)}}f(y^{(1)})\cdot (1(z^{(1)} \geq 0)*dz^{(1)}))\\ =tr(D_{y^{(1)}}f(y^{(1)})* 1(z^{(1)} \geq 0)^{T}\cdot dz^{(1)}) \]

\[ \Rightarrow D_{z^{(1)}}f(z^{(1)})=D_{y^{(1)}}f(y^{(1)})* 1(z^{(1)} \geq 0)^{T}\\ \Rightarrow \bigtriangledown_{z^{(1)}}f(z^{(1)}) =\bigtriangledown_{y^{(1)}}f(y^{(1)})* 1(z^{(1)} \geq 0) \in R^{N\times 784\times 6} \]

其他梯度

\[ z^{(1)}=a^{(0)}\cdot W^{(1)}+b^{(1)} \\ dz^{(1)}=da^{(0)}\cdot W^{(1)}+a^{(0)}\cdot dW^{(1)}+db^{(1)} \]

\[ d(dataloss)=tr(D_{z^{(1)}}f(z^{(1)})\cdot dz^{(1)}) =tr(D_{z^{(1)}}f(z^{(1)})\cdot (da^{(0)}\cdot W^{(1)} + a^{(0)}\cdot dW^{(1)} + db^{(1)}))\\ =tr(D_{z^{(1)}}f(z^{(1)})\cdot da^{(0)}\cdot W^{(1)}) +tr(D_{z^{(1)}}f(z^{(1)})\cdot a^{(0)}\cdot dW^{(1)}) +tr(D_{z^{(1)}}f(z^{(1)})\cdot db^{(1)})) \]

求权重矩阵\(W^{(1)}\)梯度

\[ d(dataloss)=tr(D_{z^{(1)}}f(z^{(1)})\cdot a^{(0)}\cdot dW^{(1)}) \]

\[ \Rightarrow D_{W^{(1)}}f(W^{(1)}) =D_{z^{(1)}}f(z^{(1)})\cdot a^{(0)}\\ \Rightarrow \bigtriangledown_{W^{(1)}}f(W^{(1)}) =(a^{(0)})^{T}\cdot \bigtriangledown_{z^{(1)}}f(z^{(1)}) =R^{25\times (N\cdot 784)}\cdot R^{(N\times 784)\times 6} =R^{25\times 6} \]

求偏置向量\(b^{(1)}\)梯度

\[ d(dataloss)=\frac {1}{M} \sum_{i=1}^{M} tr(D_{z^{(1)}}f(z^{(1)})\cdot db^{(1)}) \]

\[ \Rightarrow D_{b^{(1)}}f(b^{(1)})=\frac {1}{M} \sum_{i=1}^{M} D_{z^{(1)}}f(z^{(1)})\\ \Rightarrow \bigtriangledown_{b^{(1)}}f(b^{(1)})=\frac {1}{M} \sum_{i=1}^{M} \bigtriangledown_{z^{(1)}}f(z^{(1)}) \in R^{1\times 6} \]

\(M=N\cdot 784\), 表示\(dz^{(1)}\)的行数

小结

矩阵计算的优缺点

  • 优点:逻辑简单,易于理解
  • 缺点:占用额外内存(因为计算过程中每层数据体的值都应用在矩阵多个位置

相关阅读