softmax回归
softmax
回归常用于多分类问题,其输出可直接看成对类别的预测概率
假设对k
类标签([1, 2, ..., k]
)进行分类,那么经过softmax
回归计算后,输出一个k
维向量,向量中每个值都代表对一个类别的预测概率
下面先以单个输入数据为例,进行评分函数、损失函数的计算和求导,然后扩展到多个输入数据同步计算
对数函数操作
对数求和
$$
log_{a}^{x}+log_{a}^{y} = log_{a}^{xy}
$$
对数求差
$$
log_{a}^{x}-log_{a}^{y} = log_{a}^{\frac{x}{y}}
$$
指数乘法
$$
e^{x}\cdot e^{y} = e^{x+y}
$$
求导公式
若函数$u(x),v(x)均可导$,那么
$$
\left(\frac{u(x)}{v(x)}\right)^{\prime}=\frac{u^{\prime}(x) v(x)-v^{\prime}(x) u(x)}{v^{2}(x)}
$$
单个输入数据进行softmax回归计算
评分函数
假设使用softmax
回归分类数据$x$,共$k$个标签,首先进行线性回归操作
$$
z_{\theta}(x)=\theta^T\cdot x
=\begin{bmatrix}
\theta_{1}^T\
\theta_{2}^T\
…\
\theta_{k}^T
\end{bmatrix}\cdot x
=\begin{bmatrix}
\theta_{1}^T\cdot x\
\theta_{2}^T\cdot x\
…\
\theta_{k}^T\cdot x
\end{bmatrix}
$$
其中输入数据$x$大小为$(n+1)\times 1$,$\theta$大小为$(n+1)\times k$,$n$表示权重数量,$m$表示训练数据个数,$k$表示类别标签数量
输出结果$z$大小为$k\times 1$,然后对计算结果进行归一化操作,使得输出值能够表示类别概率,如下所示
$$
h_{\theta}\left(x\right)=\left[ \begin{array}{c}{p\left(y=1 | x ; \theta\right)} \ {p\left(y=2 | x ; \theta\right)} \ {\vdots} \ {p\left(y=k | x ; \theta\right)}\end{array}\right]
=\frac{1}{\sum_{j=1}^{k} e^{\theta_{j}^{T} x}} \left[ \begin{array}{c}{e^{\theta_{1}^{T} x}} \ {e^{\theta_{2}^{T} x}} \ {\vdots} \ {e^{\theta_{k}^{T} x}}\end{array}\right]
$$
其中$\theta_{1}、\theta_{2},…,\theta_{k}$的大小为$(n+1)\times 1$,输出结果是一个$k\times 1$大小向量,每列表示$k$类标签的预测概率
所以对于输入数据$x$而言,其属于标签$j$的概率是
$$
p\left(y=j | x; \theta\right)=\frac{e^{\theta_{j}^{T} x}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x}}
$$
损失函数
利用交叉熵损失(cross entropy loss
)作为softmax
回归的损失函数,用于计算训练数据对应的真正标签的损失值
$$
J(\theta)
= (-1)\cdot \sum_{j=1}^{k} 1\left{y=j\right} \ln p\left(y=j | x; \theta\right)
= (-1)\cdot \sum_{j=1}^{k} 1\left{y=j\right} \ln \frac{e^{\theta_{j}^{T} x}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x}}
$$
其中函数$1{\cdot}$是一个示性函数(indicator function
),其取值规则为
1 | 1{a true statement} = 1, and 1{a false statement} = 0 |
也就是示性函数输入为True
时,输出为1
;否则,输出为0
对权重向量$\theta_{s}$进行求导:
$$
\frac{\varphi J(\theta)}{\varphi \theta_{s}}
=(-1)\cdot \frac{\varphi }{\varphi \theta_{s}}
\left[ \
\sum_{j=1,j\neq s}^{k} 1\left{y=j \right} \ln p\left(y=j | x; \theta\right)
+1\left{y=s \right} \ln p\left(y=s | x; \theta\right) \
\right]
$$
$$
=(-1)\cdot \sum_{j=1,j\neq s}^{k} 1\left{y=j \right} \frac{1}{p\left(y=j | x; \theta\right)}\frac{\varphi p\left(y=j | x; \theta\right)}{\varphi \theta_{s}}
+(-1)\cdot 1\left{y=s \right} \frac{1}{p\left(y=s | x; \theta\right)}\frac{\varphi p\left(y=s | x; \theta\right)}{\varphi \theta_{s}}
$$
分为两种情况
- 当计算结果正好由$\theta_{s}$计算得到,此时线性运算为$z=\theta_{s}^{T} x$,计算结果为$p\left(y=s | x; \theta\right)=\frac{e^{\theta_{s}^{T} x}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x}}$,求导如下
$$
\frac{\varphi p\left(y=s | x; \theta\right)}{\varphi \theta_{s}}
=\frac{u^{\prime}(x) v(x)-v^{\prime}(x) u(x)}{v^{2}(x)}
$$
其中
$$
u(x) = e^{\theta_{s}^{T} x}, v(x)=\sum_{l=1}^{k} e^{\theta_{l}^{T} x}
$$
所以
$$
\frac{\varphi u(x)}{\varphi \theta_s} = e^{\theta_{s}^{T} x}\cdot x=u(x)\cdot x,
\frac{\varphi v(x)}{\varphi \theta_s} = e^{\theta_{s}^{T} x}\cdot x=u(x)\cdot x \
\frac{\varphi p\left(y=s | x; \theta\right)}{\varphi \theta_{s}} = p\left(y=s | x; \theta\right)\cdot x-p\left(y=s | x; \theta\right)^2\cdot x
$$
- 当计算结果不是由$\theta_{s}$计算得到,此时线性运算为$z=\theta_{j}^{T} x, j\neq s$,计算结果为$p\left(y=j | x; \theta\right)=\frac{e^{\theta_{j}^{T} x}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x}}$
$$
\frac{\varphi p\left(y=j | x; \theta\right)}{\varphi \theta_{s}}
=\frac{u^{\prime}(x) v(x)-v^{\prime}(x) u(x)}{v^{2}(x)}
$$
其中
$$
u(x) = e^{\theta_{j}^{T} x}, v(x)=\sum_{l=1}^{k} e^{\theta_{l}^{T} x}
$$
所以
$$
\frac{\varphi u(x)}{\varphi \theta_s} = e^{\theta_{j}^{T} x}\cdot x=0,
\frac{\varphi v(x)}{\varphi \theta_s} = e^{\theta_{s}^{T} x}\cdot x \
\frac{\varphi p\left(y=s | x; \theta\right)}{\varphi \theta_{s}} = -p\left(y=s | x; \theta\right)p\left(y=j | x; \theta\right)\cdot x
$$
综合上述两种情况可知,求导结果为
$$
\frac{\varphi J(\theta)}{\varphi \theta_{s}}
=(-1)\cdot \sum_{j=1,j\neq s}^{k} 1\left{y=j \right} \frac{1}{p\left(y=j | x; \theta\right)}\frac{\varphi p\left(y=j | x; \theta\right)}{\varphi \theta_{s}}
+(-1)\cdot 1\left{y=s \right} \frac{1}{p\left(y=s | x; \theta\right)}\frac{\varphi p\left(y=s | x; \theta\right)}{\varphi \theta_{s}} \
=(-1)\cdot \sum_{j=1,j\neq s}^{k} 1\left{y=j \right} \frac{1}{p\left(y=j | x; \theta\right)})\cdot (-1)\cdot p\left(y=s | x; \theta\right)p\left(y=j | x; \theta\right)\cdot x + (-1)\cdot 1\left{y=s \right} \frac{1}{p\left(y=s | x; \theta\right)}\left[p\left(y=s | x; \theta\right)\cdot x-p\left(y=s | x; \theta\right)^2\cdot x\right] \
=(-1)\cdot \sum_{j=1,j\neq s}^{k} 1\left{y=j \right}\cdot (-1)\cdot p\left(y=s | x; \theta\right)\cdot x + (-1)\cdot 1\left{y=s \right} \left[x-p\left(y=s | x; \theta\right)\cdot x\right] \
=(-1)\cdot 1\left{y=s \right} x - (-1)\cdot \sum_{j=1}^{k} 1\left{y=j \right} p\left(y=s | x; \theta\right)\cdot x
$$
因为$\sum_{j=1}^{k} 1\left{y=j \right}=1$,所以最终结果为
$$
\frac{\varphi J(\theta)}{\varphi \theta_{s}}
=(-1)\cdot \left[ 1\left{y=s \right} - p\left(y=s | x; \theta\right) \right]\cdot x
$$
批量数据进行softmax回归计算
上面实现了单个数据进行类别概率和损失函数的计算以及求导,进一步推导到批量数据进行操作
评分函数
假设使用softmax回归分类数据$x$,共$k$个标签,首先进行线性回归操作
$$
z_{\theta}(x_{i})=\theta^T\cdot x_{i}
=\begin{bmatrix}
\theta_{1}^T\
\theta_{2}^T\
…\
\theta_{k}^T
\end{bmatrix}\cdot x_{i}
=\begin{bmatrix}
\theta_{1}^T\cdot x_{i}\
\theta_{2}^T\cdot x_{i}\
…\
\theta_{k}^T\cdot x_{i}
\end{bmatrix}
$$
其中输入数据$x$大小为$(n+1)\times m$,$\theta$大小为$(n+1)\times k$,$n$表示权重数量,$m$表示训练数据个数,$k$表示类别标签数量
输出结果$z$大小为$k\times m$,然后对计算结果进行归一化操作,使得输出值能够表示类别概率,如下所示
$$
h_{\theta}\left(x_{i}\right)=\left[ \begin{array}{c}{p\left(y=1 | x_{i} ; \theta\right)} \
{p\left(y=2 | x_{i} ; \theta\right)} \
{\vdots} \
{p\left(y=k | x_{i} ; \theta\right)}\end{array}\right]
=\frac{1}{\sum_{j=1}^{k} e^{\theta_{j}^{T} x}} \left[ \begin{array}{c}{e^{\theta_{1}^{T} x_{i}}} \
{e^{\theta_{2}^{T} x_{i}}} \
{\vdots} \
{e^{\theta_{k}^{T} x_{i}}}\end{array}\right]
$$
其中$\theta_{1}、\theta_{2},…,\theta_{k}$的大小为$(n+1)\times 1$,输出结果是一个$k\times m$大小向量,每列表示$k$类标签的预测概率
所以对于输入数据$x_{i}$而言,其属于标签$j$的概率是
$$
p\left(y_{i}=j | x_{i}; \theta\right)=\frac{e^{\theta_{j}^{T} x_{i}}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x_{i}}}
$$
代价函数
利用交叉熵损失(cross entropy loss)作为softmax回归的代价函数,用于计算训练数据对应的真正标签的损失值
$$
J(\theta)
= (-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left{y_{i}=j\right} \ln p\left(y_{i}=j | x_{i}; \theta\right)
= (-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left{y_{i}=j\right} \ln \frac{e^{\theta_{j}^{T} x_{i}}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x_{i}}}
$$
其中函数$1{\cdot}$是一个示性函数(indicator function),其取值规则为
1 | 1{a true statement} = 1, and 1{a false statement} = 0 |
也就是示性函数输入为True时,输出为1;否则,输出为0
对权重向量$\theta_{s}$进行求导:
$$
\frac{\varphi J(\theta)}{\varphi \theta_{s}}
=(-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot \frac{\varphi }{\varphi \theta_{s}}
\left[ \sum_{j=1,j\neq s}^{k} 1\left{y_{i}=j \right} \ln p\left(y_{i}=j | x_{i}; \theta\right)+1\left{y_{i}=s \right} \ln p\left(y_{i}=s | x_{i}; \theta\right) \right]
$$
$$
=(-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot \sum_{j=1,j\neq s}^{k} 1\left{y_{i}=j \right} \frac{1}{p\left(y_{i}=j | x_{i}; \theta\right)}\frac{\varphi p\left(y_{i}=j | x_{i}; \theta\right)}{\varphi \theta_{s}}
+(-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot 1\left{y_{i}=s \right} \frac{1}{p\left(y_{i}=s | x_{i}; \theta\right)}\frac{\varphi p\left(y_{i}=s | x_{i}; \theta\right)}{\varphi \theta_{s}}
$$
分为两种情况
- 当计算结果正好由$\theta_{s}$计算得到,此时线性运算为$z=\theta_{s}^{T} x_{i}$,计算结果为$p\left(y_{i}=s | x_{i}; \theta\right)=\frac{e^{\theta_{s}^{T} x_{i}}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x_{i}}}$,求导如下
$$
\frac{\varphi p\left(y_{i}=s | x_{i}; \theta\right)}{\varphi \theta_{s}}
=\frac{u^{\prime}(x) v(x)-v^{\prime}(x) u(x)}{v^{2}(x)}
$$
其中
$$
u(x) = e^{\theta_{s}^{T} x}, v(x)=\sum_{l=1}^{k} e^{\theta_{l}^{T} x}
$$
所以
$$
\frac{\varphi u(x)}{\varphi \theta_s} = e^{\theta_{s}^{T} x}\cdot x=u(x)\cdot x,
\frac{\varphi v(x)}{\varphi \theta_s} = e^{\theta_{s}^{T} x}\cdot x=u(x)\cdot x \
\frac{\varphi p\left(y=s | x_{i}; \theta\right)}{\varphi \theta_{s}} = p\left(y=s | x_{i}; \theta\right)\cdot x_{i}-p\left(y=s | x_{i}; \theta\right)^2\cdot x_{i}
$$
- 当计算结果不是由$\theta_{s}$计算得到,此时线性运算为$z=\theta_{j}^{T} x_{i}, j\neq s$,计算结果为$p\left(y_{i}=j | x_{i}; \theta\right)=\frac{e^{\theta_{j}^{T} x_{i}}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x_{i}}}$
$$
\frac{\varphi p\left(y_{i}=j | x_{i}; \theta\right)}{\varphi \theta_{s}}
=\frac{u^{\prime}(x) v(x)-v^{\prime}(x) u(x)}{v^{2}(x)}
$$
其中
$$
u(x) = e^{\theta_{j}^{T} x}, v(x)=\sum_{l=1}^{k} e^{\theta_{l}^{T} x}
$$
所以
$$
\frac{\varphi u(x)}{\varphi \theta_s} = e^{\theta_{j}^{T} x}\cdot x=0,
\frac{\varphi v(x)}{\varphi \theta_s} = e^{\theta_{s}^{T} x}\cdot x \
\frac{\varphi p\left(y_{i}=s | x_{i}; \theta\right)}{\varphi \theta_{s}} = -p\left(y_{i}=s | x_{i}; \theta\right)p\left(y_{i}=j | x_{i}; \theta\right)\cdot x_{i}
$$
综合上述两种情况可知,求导结果为
$$
\frac{\varphi J(\theta)}{\varphi \theta_{s}}
=(-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot \sum_{j=1,j\neq s}^{k} 1\left{y_{i}=j \right} \frac{1}{p\left(y_{i}=j | x_{i}; \theta\right)}\frac{\varphi p\left(y_{i}=j | x_{i}; \theta\right)}{\varphi \theta_{s}}
+(-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot 1\left{y_{i}=s \right} \frac{1}{p\left(y_{i}=s | x_{i}; \theta\right)}\frac{\varphi p\left(y_{i}=s | x_{i}; \theta\right)}{\varphi \theta_{s}} \
=(-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot \sum_{j=1,j\neq s}^{k} 1\left{y_{i}=j \right} \frac{1}{p\left(y_{i}=j | x_{i}; \theta\right)})\cdot (-1)\cdot p\left(y_{i}=s | x_{i}; \theta\right)p\left(y_{i}=j | x_{i}; \theta\right)\cdot x_{i} + (-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot 1\left{y_{i}=s \right} \frac{1}{p\left(y_{i}=s | x_{i}; \theta\right)}\left[p\left(y_{i}=s | x_{i}; \theta\right)\cdot x_{i}-p\left(y_{i}=s | x_{i}; \theta\right)^2\cdot x_{i}\right] \
=(-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot \sum_{j=1,j\neq s}^{k} 1\left{y_{i}=j \right}\cdot (-1)\cdot p\left(y_{i}=s | x_{i}; \theta\right)\cdot x_{i} + (-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot 1\left{y_{i}=s \right} \left[x_{i}-p\left(y_{i}=s | x_{i}; \theta\right)\cdot x_{i}\right] \
=(-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot 1\left{y_{i}=s \right} x_{i} - (-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot \sum_{j=1}^{k} 1\left{y_{i}=j \right} p\left(y_{i}=s | x_{i}; \theta\right)\cdot x_{i}
$$
因为$\sum_{j=1}^{k} 1\left{y_{i}=j \right}=1$,所以最终结果为
$$
\frac{\varphi J(\theta)}{\varphi \theta_{s}}
=(-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot \left[ 1\left{y_{i}=s \right} - p\left(y_{i}=s | x_{i}; \theta\right) \right]\cdot x_{i}
$$
梯度下降
权重$W$大小为$(n+1)\times k$,输入数据集大小为$m\times (n+1)$,输出数据集大小为$m\times k$
矩阵求导如下:
$$
\frac{\varphi J(\theta)}{\varphi \theta}
=\frac{1}{m}\cdot \sum_{i=1}^{m}\cdot
\begin{bmatrix}
(-1)\cdot\left[ 1\left{y=1 \right} - p\left(y=1 | x; \theta\right) \right]\cdot x\
(-1)\cdot\left[ 1\left{y=2 \right} - p\left(y=2 | x; \theta\right) \right]\cdot x\
…\
(-1)\cdot\left[ 1\left{y=k \right} - p\left(y=k | x; \theta\right) \right]\cdot x
\end{bmatrix}
=(-1)\cdot \frac{1}{m}\cdot X_{m\times n+1}^T \cdot (I_{m\times k} - Y_{m\times k})
$$
参考:
Softmax regression for Iris classification
Derivative of Softmax loss function
上述计算的是输入单个数据时的评分、损失和求导,所以使用随机梯度下降法进行权重更新,分类
参数冗余和权重衰减
softmax
回归存在参数冗余现象,即对参数向量$\theta_{j}$减去向量$\varphi $不改变预测结果。证明如下:
$$
\begin{aligned} p\left(y^{(i)}=j | x^{(i)} ; \theta\right) &=\frac{e^{\left(\theta_{j}-\psi\right)^{T} x^{(i)}}}{\sum_{l=1}^{k} e^{\left(\theta_{l}-\psi\right)^{T} x^{(i)}}} \ &=\frac{e^{\theta_{j}^{T} x^{(i)}} e^{-\psi^{T} x^{(i)}}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x^{(i)}} e^{-\psi^{T} x^{(i)}}} \ &=\frac{e^{\theta_{j}^{T} x^{(i)}}}{\sum_{l=1}^{k} e^{\theta_{t}^{T} x^{(i)}}} \end{aligned}
$$
假设$(\theta_{1},\theta_{2},…,\theta_{k})$能得到$j(\theta)$的极小值点,那么$(\theta_{1}-\varphi,\theta_{2}-\varphi,…,\theta_{k}-\varphi)$同样能得到相同的极小值点
与此同时,因为损失函数是凸函数,局部最小值就是全局最小值,所以会导致权重在参数过大情况下就停止收敛,影响模型泛化能力
在代价函数中加入权重衰减,能够避免过度参数化,得到泛化性能更强的模型
在代价函数中加入L2
正则化项,如下所示:
$$
J(\theta)
= (-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left{y_{i}=j\right} \ln p\left(y_{i}=j | x_{i}; \theta\right) + \frac{\lambda}{2} \sum_{i=1}^{k} \sum_{j=0}^{n} \theta_{i j}^{2}
= (-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left{y_{i}=j\right} \ln \frac{e^{\theta_{j}^{T} x_{i}}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x_{i}}} + \frac{\lambda}{2} \sum_{i=1}^{k} \sum_{j=0}^{n} \theta_{i j}^{2}
$$
求导结果如下:
$$
\frac{\varphi J(\theta)}{\varphi \theta_{s}}
=(-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot \left[ 1\left{y_{i}=s \right} - p\left(y_{i}=s | x_{i}; \theta\right) \right]\cdot x_{i}+ \lambda \theta_{j}
$$
代价实现如下:
1 | def compute_loss(scores, indicator, W): |
鸢尾数据集
使用鸢尾(iris)数据集,参考Iris Species
共4
个变量:
SepalLengthCm
- 花萼长度SepalWidthCm
- 花萼宽度PetalLengthCm
- 花瓣长度PetalWidthCm
- 花瓣宽度
以及3
个类别:
Iris-setosa
Iris-versicolor
Iris-virginica
1 | def load_data(shuffle=True, tsize=0.8): |
numpy实现
1 | # -*- coding: utf-8 -*- |
训练10万次的最好训练结果以及对应的测试结果:
1 | # 测试集精度 |
指数计算 - 数值稳定性考虑
在softmax
回归中,需要利用指数函数$e^x$对线性操作的结果进行归一化,这有可能会造成数值溢出,常用的做法是对分数上下同乘以一个常数$C$
$$
\frac{e^{f_{i_{i}}}}{\sum_{j} e^{f_{j}}}=\frac{C e^{f_{y_{i}}}}{C \sum_{j} e^{f_{j}}}=\frac{e^{f_{i_{i}}+\log C}}{\sum_{j} e^{f_{j}+\log C}}
$$
这个操作不改变结果,如果取值$C$为线性操作结果最大值负数$\log C=-\max {j} f{j}$,就能够将向量$f$的取值范围降低,最大值为$0$,避免数值不稳定
1 | def softmax(x): |
softmax回归和logistic回归
softmax
回归是logistic
回归在多分类任务上的扩展,将$k=2$时,softmax
回归模型可转换成logistic
回归模型
$$
h_{\theta}(x)=\frac{1}{e^{\theta_{1}^{T} x}+e^{\theta_{2}^{T} x^{(i)}}} \left[ \begin{array}{c}{e^{\theta_{1}^{T} x}} \ {e^{\theta_{2}^{T} x}}\end{array}\right]
=\frac{1}{e^{\vec{0}^{T} x}+e^{(\theta_{2}-\theta_{1})^{T} x^{(i)}}} \left[ \begin{array}{c}{e^{\vec{0}^{T} x}} \ {e^{(\theta_{2}-\theta_{1})^{T} x}}\end{array}\right] \
=\frac{1}{1+e^{(\theta_{2}-\theta_{1})^{T} x^{(i)}}} \left[ \begin{array}{c}{1} \ {e^{(\theta_{2}-\theta_{1})^{T} x}}\end{array}\right]
= \left[ \begin{array}{c}{\frac{1}{1+e^{(\theta_{2}-\theta_{1})^{T} x^{(i)}}}} \ {\frac{e^{(\theta_{2}-\theta_{1})^{T} x}}{1+e^{(\theta_{2}-\theta_{1})^{T} x^{(i)}}}}\end{array}\right]
=\left[ \begin{array}{c}{\frac{1}{1+e^{(\theta_{2}-\theta_{1})^{T} x^{(i)}}}} \ {1- \frac{1}{1+e^{(\theta_{2}-\theta_{1})^{T} x^{(i)}}}}\end{array}\right]
$$
针对多分类任务,可以选择softmax
回归模型进行多分类,也可以选择logistic
回归模型进行若干个二分类
区别在于选择的类别是否互斥,如果类别互斥,使用softmax
回归分类更为合适;如果类别不互斥,使用logistic
回归分类更为合适