神经网络推导-矩阵计算

为了理清如何进行神经网络的前向传播和反向传播的推导,找了很多资料,前向传播比较简单,重点在于如何进行反向传播的梯度计算

cs231n课程推荐的计算方式是先进行单个元素求导,再逐步泛化到批量数据求梯度,参考

Derivatives, Backpropagation, and Vectorization - CS231n

Vector, Matrix, and Tensor Derivatives - CS231n

Backpropagation and Neural Networks - CS231n

自己也根据参考资料进行2层神经网络逐元素的推导

神经网络推导-单个数据

神经网络推导-批量数据

最好的方式当然是进行矩阵求导,在网上看了很多博客,比较好的有

矩阵求导术(上)

[矩阵求导]神经网络反向传播梯度计算数学原理

神经网络矩阵计算最重要的内容是进行实值标量矩阵的一阶微分以及Jacobian矩阵的辨识,参考《矩阵分析与应用》,有以下先导知识

导数、微分和梯度

矩阵基础

Jacobian矩阵和梯度矩阵

实值标量函数一阶微分和Jacobian矩阵辨识

使用矩阵微分能够很便捷的实现神经网络反向求导,关键部分是辨识Jacobian矩阵,再转换成梯度矩阵

推导一

文章[矩阵求导]神经网络反向传播梯度计算数学原理给出了一个很好的推导方式,首先给出实现代码,然后使用矩阵计算逐步解释代码

PyTorch教程Learning PyTorch with Examples给出了一个2层神经网络的numpy实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# -*- coding: utf-8 -*-
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y
h = x.dot(w1)
h_relu = np.maximum(h, 0)
y_pred = h_relu.dot(w2)

# Compute and print loss
loss = np.square(y_pred - y).sum()
print(t, loss)

# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.T.dot(grad_y_pred)
grad_h_relu = grad_y_pred.dot(w2.T)
grad_h = grad_h_relu.copy()
grad_h[h < 0] = 0
grad_w1 = x.T.dot(grad_h)

# Update weights
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2

第一步:定义网络参数

  • 批量数据大小N=64
  • 输入层神经元个数Din=1000
  • 隐藏层神经元个数H=100
  • 输出层神经元个数Dout=10
1
2
3
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

第二步:初始化数据、权重(该网络没有偏置向量)以及学习率

  • 输入数据xRN×Din
  • 输出数据yRN×Dout
  • 隐藏层权重矩阵w1RDin×H
  • 输出层权重矩阵w2RH×Dout
1
2
3
4
5
6
7
8
9
# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6

第三步:迭代计算,输入批量数据到神经网络,进行前向传播

h=xw1hrelu=max(0,h)ypred=hreluw2

1
2
3
4
# Forward pass: compute predicted y
h = x.dot(w1)
h_relu = np.maximum(h, 0)
y_pred = h_relu.dot(w2)

第四步:迭代计算,计算损失函数(误差平方和 - L1范数的平方

loss=ypredy2

1
2
3
# Compute and print loss
loss = np.square(y_pred - y).sum()
print(t, loss)

第五步:迭代计算,反向传播,计算输出层输入向量梯度

ypredy=XX大小为N×Dout,则

loss=X2=(vec(X))Tvec(X)

对损失函数loss(ypred)求输出层输入向量的微分

dloss=d(tr(loss))=tr(dloss)=tr(d((vec(X))Tvec(X)))=tr(d(vec(X)T)vec(X)+vec(X)Tdvec(X))=tr(d(vec(X)T)vec(X))+tr(vec(X)Tdvec(X))=tr((dvec(X))Tvec(X))+tr(vec(X)Tdvec(X))=tr((vec(X))Tdvec(X))+tr(vec(X)Tdvec(X))=tr(2(vec(X))Tdvec(X))=tr(2XTdX)

所以Jacobian矩阵为DXf(X)=2XT,梯度矩阵为Xf(X)=2X=2(ypredy)

1
grad_y_pred = 2.0 * (y_pred - y)

第六步:迭代计算,反向传播,计算输出层权重向量以及隐藏层输出向量梯度

ypred=hreluw2dypred=dhreluw2+hreludw2

dloss=tr(2XTdX)=tr(2(ypredy)Td((ypredy)))=tr(2(ypredy)Tdypred)=tr(2(ypredy)T(dhreluw2+hreludw2))=tr(2(ypredy)Tdhreluw2)+tr(2(ypredy)Threludw2)=tr(w22(ypredy)Tdhrelu)+tr(2(ypredy)Threludw2)

输出层权重向量的Jacobian矩阵为2(ypredy)Threlu,梯度矩阵为(hrelu)T2(ypredy)

隐藏层输出向量的Jacobian矩阵为w22(ypredy)T,梯度矩阵为2(ypredy)(w2)T

1
2
grad_w2 = h_relu.T.dot(grad_y_pred)
grad_h_relu = grad_y_pred.dot(w2.T)

第七步:迭代计算,反向传播,计算隐藏层输入向量梯度

hrelu=max(0,h)dhrelu={dhh00h<0=1(h0)dh

激活函数是逐个元素操作,所以使用Hadamard积

dloss=tr(w22(ypredy)Tdhrelu)=tr(w22(ypredy)T1(h0)dh)=tr((2(ypredy)(w2)T)T1(h0)dh)=tr((2(ypredy)(w2)T)T1(h0)Tdh)

所以Jacobian矩阵为(2(ypredy)(w2)T)T1(h0)T,梯度矩阵为

hf(h)=1(h0)2(ypredy)(w2)T

1
2
grad_h = grad_h_relu.copy()
grad_h[h < 0] = 0

第八步:迭代计算,反向传播,计算隐藏层权重向量梯度

h=xw1dh=xdw1

dloss=tr((2(ypredy)(w2)T)T1(h0)Tdh)=tr((2(ypredy)(w2)T)T1(h0)Txdw1)

所以Jacobian矩阵为(2(ypredy)(w2)T)T1(h0)Tx,梯度矩阵为

w1f(w1)=((2(ypredy)(w2)T)T1(h0)Tx)T=xT1(h0)2(ypredy)(w2)T

1
grad_w1 = x.T.dot(grad_h)

第九步:迭代计算,反向传播,更新权重矩阵

1
2
3
# Update weights
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2

推导二

cs231n课程Putting it together: Minimal Neural Network Case Study中实现了一个2层神经网络

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
N = 100 # number of points per class
D = 2 # dimensionality
K = 3 # number of classes
X = np.zeros((N*K,D)) # data matrix (each row = single example)
y = np.zeros(N*K, dtype='uint8') # class labels
for j in xrange(K):
ix = range(N*j,N*(j+1))
r = np.linspace(0.0,1,N) # radius
t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 # theta
X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
y[ix] = j

# initialize parameters randomly
h = 100 # size of hidden layer
W = 0.01 * np.random.randn(D,h)
b = np.zeros((1,h))
W2 = 0.01 * np.random.randn(h,K)
b2 = np.zeros((1,K))

# some hyperparameters
step_size = 1e-0
reg = 1e-3 # regularization strength

# gradient descent loop
num_examples = X.shape[0]
for i in xrange(10000):

# evaluate class scores, [N x K]
hidden_layer = np.maximum(0, np.dot(X, W) + b) # note, ReLU activation
scores = np.dot(hidden_layer, W2) + b2

# compute the class probabilities
exp_scores = np.exp(scores)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) # [N x K]

# compute the loss: average cross-entropy loss and regularization
correct_logprobs = -np.log(probs[range(num_examples),y])
data_loss = np.sum(correct_logprobs)/num_examples
reg_loss = 0.5*reg*np.sum(W*W) + 0.5*reg*np.sum(W2*W2)
loss = data_loss + reg_loss
if i % 1000 == 0:
print "iteration %d: loss %f" % (i, loss)

# compute the gradient on scores
dscores = probs
dscores[range(num_examples),y] -= 1
dscores /= num_examples

# backpropate the gradient to the parameters
# first backprop into parameters W2 and b2
dW2 = np.dot(hidden_layer.T, dscores)
db2 = np.sum(dscores, axis=0, keepdims=True)
# next backprop into hidden layer
dhidden = np.dot(dscores, W2.T)
# backprop the ReLU non-linearity
dhidden[hidden_layer <= 0] = 0
# finally into W,b
dW = np.dot(X.T, dhidden)
db = np.sum(dhidden, axis=0, keepdims=True)

# add regularization gradient contribution
dW2 += reg * W2
dW += reg * W

# perform a parameter update
W += -step_size * dW
b += -step_size * db
W2 += -step_size * dW2
b2 += -step_size * db2

第一步:设置批量输入数据和输出数据

  • 批量数据大小N=100
  • 数据维数D=2
  • 类别数K=3
  • 输入数据XRN×D
  • 输出数据yRN×K
1
2
3
4
5
6
7
8
9
10
11
N = 100 # number of points per class
D = 2 # dimensionality
K = 3 # number of classes
X = np.zeros((N*K,D)) # data matrix (each row = single example)
y = np.zeros(N*K, dtype='uint8') # class labels
for j in xrange(K):
ix = range(N*j,N*(j+1))
r = np.linspace(0.0,1,N) # radius
t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 # theta
X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
y[ix] = j

第二步:初始化权重参数

  • 隐藏层神经元个数h=100
  • 隐藏层权重矩阵WRD×h
  • 隐藏层偏置向量bR1×h
  • 输出层权重矩阵W2Rh×K
  • 输出层偏置向量b2R1×K
1
2
3
4
5
6
# initialize parameters randomly
h = 100 # size of hidden layer
W = 0.01 * np.random.randn(D,h)
b = np.zeros((1,h))
W2 = 0.01 * np.random.randn(h,K)
b2 = np.zeros((1,K))

第三步:设置学习率和正则化强度

1
2
3
# some hyperparameters
step_size = 1e-0
reg = 1e-3 # regularization strength

第四步:迭代计算,输入批量数据到神经网络,进行前向传播

hiddenLayer=max(XW+b,0)scores=hiddenLayerW2+b2

1
2
3
# evaluate class scores, [N x K]
hidden_layer = np.maximum(0, np.dot(X, W) + b) # note, ReLU activation
scores = np.dot(hidden_layer, W2) + b2

第四步:迭代计算,计算损失值

expScores=exp(scores)probs=expScoresexpScores1correctLogProbs=lnprobsyRN×1dataLoss=1N1TcorrectLogProbsregLoss=0.5reg||W||2+0.5reg||W2||2loss=dataLoss+regLoss

1表示求和向量:[1,1,...]T

probsy表示每行正确类别的概率

1
2
3
4
5
6
7
8
9
10
11
# compute the class probabilities
exp_scores = np.exp(scores)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) # [N x K]

# compute the loss: average cross-entropy loss and regularization
correct_logprobs = -np.log(probs[range(num_examples),y])
data_loss = np.sum(correct_logprobs)/num_examples
reg_loss = 0.5*reg*np.sum(W*W) + 0.5*reg*np.sum(W2*W2)
loss = data_loss + reg_loss
if i % 1000 == 0:
print "iteration %d: loss %f" % (i, loss)

第五步:迭代计算,反向传播,计算输出层输入向量梯度

scoresy=scoresY1expscoresy=exp(scoresY1)    expscores=exp(scores)    expscoressum=exp(scores)1probsy=expscoresyexpscoressum    probs=expscoresexpscoressum

dataloss=1N1Tln(probsy)=1N1Tlnexpscoresyexpscoressum=1N1T(lnexpscoresylnexpscoressum)=1N1T(scoresY1lnexpscoressum)

d(dataloss)=tr(d(1N(1TscoresY11Tlnexpscoressum)))=tr(d(1N(1TscoresY1)))tr(d(1N(1Tlnexpscoressum)))

tr(d(1N(1TscoresY1)))=tr(1N(1TdscoresY1))=tr(1N(dscoresTY))=tr(1NYTdscores)

tr(d(1N(1Tlnexpscoressum)))=tr(1N(1Texpscoressum1dexpscoressum))=tr(1N(1Tdexpscoressum)expscoressum)=tr(1N(1Texp(scores)dscores1)expscoressum)=tr(1Nexp(scores)Tdscoresexpscoressum)=tr(1N(exp(scores)expscoressum)Tdscores)=tr(1NprobsTdscores)

d(dataloss)=tr(1NYTdscores)tr(1NprobsTdscores)=tr(1N(probsTYT)dscores)

所以Jacobian矩阵为Dscoresf(scores)=probsTYT,梯度矩阵为scoresf(scores)=probsY

  • Y大小为N×K,每行仅正确类别位置为1,其余为0
  • 1是求和向量,[1,1,...]T

计算softmax分类的交叉熵损失关于输出层输入向量梯度,这一部分想了好久,主要问题是关于矩阵除法和逐元素除法(标量除法)的分别,感觉还是先对单个数据进行求梯度再泛化比较方便

1
2
3
4
# compute the gradient on scores
dscores = probs
dscores[range(num_examples),y] -= 1
dscores /= num_examples

第六步:迭代计算,反向传播,计算输出层权重矩阵、偏置向量以及隐藏层输出向量梯度

scores=hiddenLayerW2+b2dscores=dhiddenLayerW2+hiddenLayerdW2+db2

d(dataloss)=tr(1N(probsTYT)dscores)=tr(1N(probsTYT)(dhiddenLayerW2+hiddenLayerdW2+db2))=tr(1N(probsTYT)dhiddenLayerW2)+tr(1N(probsTYT)hiddenLayerdW2)+tr(1N(probsTYT)db2)

求输出层权重矩阵梯度

d(dataloss)=tr(1N(probsTYT)hiddenLayerdW2)

DW2f(W2)=1N(probsTYT)hiddenLayerW2f(W2)=1NhiddenLayerT(probsY)

求输出层偏置向量梯度

d(dataloss)=tr(1Ni=1N(probsiTYiT)db2)

Db2f(b2)=1Ni=1N(probsiTYiT)b2f(b2)=1Ni=1N(probsiYi)

对偏置向量还需要注意维数,求和批量数据的偏置向量梯度

求隐藏层输出向量梯度

d(dataloss)=tr(1N(probsTYT)dhiddenLayerW2)=tr(1NW2(probsTYT)dhiddenLayer)

DhiddenLayerf(hiddenLayer)=1NW2(probsTYT)hiddenLayerf(hiddenLayer)=1N(probsY)(W2)T

1
2
3
4
5
6
# backpropate the gradient to the parameters
# first backprop into parameters W2 and b2
dW2 = np.dot(hidden_layer.T, dscores)
db2 = np.sum(dscores, axis=0, keepdims=True)
# next backprop into hidden layer
dhidden = np.dot(dscores, W2.T)

第七步:迭代计算,反向传播,计算隐藏层输入向量梯度

hiddenLayerin=XW+bhiddenLayer=max(0,hiddenLayerin)dhiddenLayer=1(hiddenLayerin0)dhiddenLayerin

d(dataloss)=tr(1NW2(probsTYT)dhiddenLayer)=tr(1NW2(probsTYT)1(hiddenLayerin0)dhiddenLayerin)=tr(1N(W2(probsTYT))T1(hiddenLayerin0)TdhiddenLayerin)

DhiddenLayerinf(hiddenLayerin)=1N(W2(probsTYT))T1(hiddenLayerin0)ThiddenLayerinf(hiddenLayerin)=1N((probsY)(W2)T)1(hiddenLayerin0)

1
2
# backprop the ReLU non-linearity
dhidden[hidden_layer <= 0] = 0

第七步:迭代计算,反向传播,计算隐藏层权重向量和偏置向量梯度

hiddenLayerin=XW+bdhiddenLayerin=XdW+db

d(dataloss)=tr(1N(W2(probsTYT))T1(hiddenLayerin0)TdhiddenLayerin)=tr(1N(W2(probsTYT))T1(hiddenLayerin0)T(XdW+db))

求隐藏层权重向量梯度

d(dataloss)=tr(1N(W2(probsTYT))T1(hiddenLayerin0)TXdW)

DWf(W)=1N(W2(probsTYT))T1(hiddenLayerin0)TXWf(W)=1NXT((probsY)(W2)T)1(hiddenLayerin0)

求隐藏层偏置向量梯度

d(dataloss)=tr(1Ni=1N(W2(probsTYT))T1(hiddenLayerin0)Tdb)

DWf(W)=1Ni=1N(W2(probsTYT))T1(hiddenLayerin0)TWf(W)=1Ni=1N((probsY)(W2)T)1(hiddenLayerin0)

对偏置向量还需要注意维数,求和批量数据的偏置向量梯度

1
2
3
# finally into W,b
dW = np.dot(X.T, dhidden)
db = np.sum(dhidden, axis=0, keepdims=True)

第八步:迭代计算,反向传播,计算正则化梯度

regLoss=0.5reg||W||2+0.5reg||W2||2d(regLoss)=regWdW+regW2dW2

1
2
3
# add regularization gradient contribution
dW2 += reg * W2
dW += reg * W

第九步:迭代计算,反向传播,更新权重矩阵和偏置向量

1
2
3
4
5
# perform a parameter update
W += -step_size * dW
b += -step_size * db
W2 += -step_size * dW2
b2 += -step_size * db2

相关资料

  1. The Matrix Cookbook - Mathematics
  2. Matrix calculus