softmax回归常用于多分类问题，其输出可直接看成对类别的预测概率

## 对数函数操作

$$log_{a}^{x}+log_{a}^{y} = log_{a}^{xy}$$

$$log_{a}^{x}-log_{a}^{y} = log_{a}^{\frac{x}{y}}$$

$$e^{x}\cdot e^{y} = e^{x+y}$$

## 求导公式

$$\left(\frac{u(x)}{v(x)}\right)^{\prime}=\frac{u^{\prime}(x) v(x)-v^{\prime}(x) u(x)}{v^{2}(x)}$$

## 单个输入数据进行softmax回归计算

### 评分函数

$$z_{\theta}(x)=\theta^T\cdot x =\begin{bmatrix} \theta_{1}^T\ \theta_{2}^T\ …\ \theta_{k}^T \end{bmatrix}\cdot x =\begin{bmatrix} \theta_{1}^T\cdot x\ \theta_{2}^T\cdot x\ …\ \theta_{k}^T\cdot x \end{bmatrix}$$

$$h_{\theta}\left(x\right)=\left[ \begin{array}{c}{p\left(y=1 | x ; \theta\right)} \ {p\left(y=2 | x ; \theta\right)} \ {\vdots} \ {p\left(y=k | x ; \theta\right)}\end{array}\right] =\frac{1}{\sum_{j=1}^{k} e^{\theta_{j}^{T} x}} \left[ \begin{array}{c}{e^{\theta_{1}^{T} x}} \ {e^{\theta_{2}^{T} x}} \ {\vdots} \ {e^{\theta_{k}^{T} x}}\end{array}\right]$$

$$p\left(y=j | x; \theta\right)=\frac{e^{\theta_{j}^{T} x}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x}}$$

### 损失函数

$$J(\theta) = (-1)\cdot \sum_{j=1}^{k} 1\left{y=j\right} \ln p\left(y=j | x; \theta\right) = (-1)\cdot \sum_{j=1}^{k} 1\left{y=j\right} \ln \frac{e^{\theta_{j}^{T} x}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x}}$$

$$\frac{\varphi J(\theta)}{\varphi \theta_{s}} =(-1)\cdot \frac{\varphi }{\varphi \theta_{s}} \left[ \ \sum_{j=1,j\neq s}^{k} 1\left{y=j \right} \ln p\left(y=j | x; \theta\right) +1\left{y=s \right} \ln p\left(y=s | x; \theta\right) \ \right]$$

$$=(-1)\cdot \sum_{j=1,j\neq s}^{k} 1\left{y=j \right} \frac{1}{p\left(y=j | x; \theta\right)}\frac{\varphi p\left(y=j | x; \theta\right)}{\varphi \theta_{s}} +(-1)\cdot 1\left{y=s \right} \frac{1}{p\left(y=s | x; \theta\right)}\frac{\varphi p\left(y=s | x; \theta\right)}{\varphi \theta_{s}}$$

• 当计算结果正好由$\theta_{s}$计算得到，此时线性运算为$z=\theta_{s}^{T} x$，计算结果为$p\left(y=s | x; \theta\right)=\frac{e^{\theta_{s}^{T} x}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x}}$，求导如下

$$\frac{\varphi p\left(y=s | x; \theta\right)}{\varphi \theta_{s}} =\frac{u^{\prime}(x) v(x)-v^{\prime}(x) u(x)}{v^{2}(x)}$$

$$u(x) = e^{\theta_{s}^{T} x}, v(x)=\sum_{l=1}^{k} e^{\theta_{l}^{T} x}$$

$$\frac{\varphi u(x)}{\varphi \theta_s} = e^{\theta_{s}^{T} x}\cdot x=u(x)\cdot x, \frac{\varphi v(x)}{\varphi \theta_s} = e^{\theta_{s}^{T} x}\cdot x=u(x)\cdot x \ \frac{\varphi p\left(y=s | x; \theta\right)}{\varphi \theta_{s}} = p\left(y=s | x; \theta\right)\cdot x-p\left(y=s | x; \theta\right)^2\cdot x$$

• 当计算结果不是由$\theta_{s}$计算得到，此时线性运算为$z=\theta_{j}^{T} x, j\neq s$，计算结果为$p\left(y=j | x; \theta\right)=\frac{e^{\theta_{j}^{T} x}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x}}$

$$\frac{\varphi p\left(y=j | x; \theta\right)}{\varphi \theta_{s}} =\frac{u^{\prime}(x) v(x)-v^{\prime}(x) u(x)}{v^{2}(x)}$$

$$u(x) = e^{\theta_{j}^{T} x}, v(x)=\sum_{l=1}^{k} e^{\theta_{l}^{T} x}$$

$$\frac{\varphi u(x)}{\varphi \theta_s} = e^{\theta_{j}^{T} x}\cdot x=0, \frac{\varphi v(x)}{\varphi \theta_s} = e^{\theta_{s}^{T} x}\cdot x \ \frac{\varphi p\left(y=s | x; \theta\right)}{\varphi \theta_{s}} = -p\left(y=s | x; \theta\right)p\left(y=j | x; \theta\right)\cdot x$$

$$\frac{\varphi J(\theta)}{\varphi \theta_{s}} =(-1)\cdot \sum_{j=1,j\neq s}^{k} 1\left{y=j \right} \frac{1}{p\left(y=j | x; \theta\right)}\frac{\varphi p\left(y=j | x; \theta\right)}{\varphi \theta_{s}} +(-1)\cdot 1\left{y=s \right} \frac{1}{p\left(y=s | x; \theta\right)}\frac{\varphi p\left(y=s | x; \theta\right)}{\varphi \theta_{s}} \ =(-1)\cdot \sum_{j=1,j\neq s}^{k} 1\left{y=j \right} \frac{1}{p\left(y=j | x; \theta\right)})\cdot (-1)\cdot p\left(y=s | x; \theta\right)p\left(y=j | x; \theta\right)\cdot x + (-1)\cdot 1\left{y=s \right} \frac{1}{p\left(y=s | x; \theta\right)}\left[p\left(y=s | x; \theta\right)\cdot x-p\left(y=s | x; \theta\right)^2\cdot x\right] \ =(-1)\cdot \sum_{j=1,j\neq s}^{k} 1\left{y=j \right}\cdot (-1)\cdot p\left(y=s | x; \theta\right)\cdot x + (-1)\cdot 1\left{y=s \right} \left[x-p\left(y=s | x; \theta\right)\cdot x\right] \ =(-1)\cdot 1\left{y=s \right} x - (-1)\cdot \sum_{j=1}^{k} 1\left{y=j \right} p\left(y=s | x; \theta\right)\cdot x$$

$$\frac{\varphi J(\theta)}{\varphi \theta_{s}} =(-1)\cdot \left[ 1\left{y=s \right} - p\left(y=s | x; \theta\right) \right]\cdot x$$

## 批量数据进行softmax回归计算

### 评分函数

$$z_{\theta}(x_{i})=\theta^T\cdot x_{i} =\begin{bmatrix} \theta_{1}^T\ \theta_{2}^T\ …\ \theta_{k}^T \end{bmatrix}\cdot x_{i} =\begin{bmatrix} \theta_{1}^T\cdot x_{i}\ \theta_{2}^T\cdot x_{i}\ …\ \theta_{k}^T\cdot x_{i} \end{bmatrix}$$

$$h_{\theta}\left(x_{i}\right)=\left[ \begin{array}{c}{p\left(y=1 | x_{i} ; \theta\right)} \ {p\left(y=2 | x_{i} ; \theta\right)} \ {\vdots} \ {p\left(y=k | x_{i} ; \theta\right)}\end{array}\right] =\frac{1}{\sum_{j=1}^{k} e^{\theta_{j}^{T} x}} \left[ \begin{array}{c}{e^{\theta_{1}^{T} x_{i}}} \ {e^{\theta_{2}^{T} x_{i}}} \ {\vdots} \ {e^{\theta_{k}^{T} x_{i}}}\end{array}\right]$$

$$p\left(y_{i}=j | x_{i}; \theta\right)=\frac{e^{\theta_{j}^{T} x_{i}}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x_{i}}}$$

### 代价函数

$$J(\theta) = (-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left{y_{i}=j\right} \ln p\left(y_{i}=j | x_{i}; \theta\right) = (-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left{y_{i}=j\right} \ln \frac{e^{\theta_{j}^{T} x_{i}}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x_{i}}}$$

$$\frac{\varphi J(\theta)}{\varphi \theta_{s}} =(-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot \frac{\varphi }{\varphi \theta_{s}} \left[ \sum_{j=1,j\neq s}^{k} 1\left{y_{i}=j \right} \ln p\left(y_{i}=j | x_{i}; \theta\right)+1\left{y_{i}=s \right} \ln p\left(y_{i}=s | x_{i}; \theta\right) \right]$$

$$=(-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot \sum_{j=1,j\neq s}^{k} 1\left{y_{i}=j \right} \frac{1}{p\left(y_{i}=j | x_{i}; \theta\right)}\frac{\varphi p\left(y_{i}=j | x_{i}; \theta\right)}{\varphi \theta_{s}} +(-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot 1\left{y_{i}=s \right} \frac{1}{p\left(y_{i}=s | x_{i}; \theta\right)}\frac{\varphi p\left(y_{i}=s | x_{i}; \theta\right)}{\varphi \theta_{s}}$$

• 当计算结果正好由$\theta_{s}$计算得到，此时线性运算为$z=\theta_{s}^{T} x_{i}$，计算结果为$p\left(y_{i}=s | x_{i}; \theta\right)=\frac{e^{\theta_{s}^{T} x_{i}}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x_{i}}}$，求导如下

$$\frac{\varphi p\left(y_{i}=s | x_{i}; \theta\right)}{\varphi \theta_{s}} =\frac{u^{\prime}(x) v(x)-v^{\prime}(x) u(x)}{v^{2}(x)}$$

$$u(x) = e^{\theta_{s}^{T} x}, v(x)=\sum_{l=1}^{k} e^{\theta_{l}^{T} x}$$

$$\frac{\varphi u(x)}{\varphi \theta_s} = e^{\theta_{s}^{T} x}\cdot x=u(x)\cdot x, \frac{\varphi v(x)}{\varphi \theta_s} = e^{\theta_{s}^{T} x}\cdot x=u(x)\cdot x \ \frac{\varphi p\left(y=s | x_{i}; \theta\right)}{\varphi \theta_{s}} = p\left(y=s | x_{i}; \theta\right)\cdot x_{i}-p\left(y=s | x_{i}; \theta\right)^2\cdot x_{i}$$

• 当计算结果不是由$\theta_{s}$计算得到，此时线性运算为$z=\theta_{j}^{T} x_{i}, j\neq s$，计算结果为$p\left(y_{i}=j | x_{i}; \theta\right)=\frac{e^{\theta_{j}^{T} x_{i}}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x_{i}}}$

$$\frac{\varphi p\left(y_{i}=j | x_{i}; \theta\right)}{\varphi \theta_{s}} =\frac{u^{\prime}(x) v(x)-v^{\prime}(x) u(x)}{v^{2}(x)}$$

$$u(x) = e^{\theta_{j}^{T} x}, v(x)=\sum_{l=1}^{k} e^{\theta_{l}^{T} x}$$

$$\frac{\varphi u(x)}{\varphi \theta_s} = e^{\theta_{j}^{T} x}\cdot x=0, \frac{\varphi v(x)}{\varphi \theta_s} = e^{\theta_{s}^{T} x}\cdot x \ \frac{\varphi p\left(y_{i}=s | x_{i}; \theta\right)}{\varphi \theta_{s}} = -p\left(y_{i}=s | x_{i}; \theta\right)p\left(y_{i}=j | x_{i}; \theta\right)\cdot x_{i}$$

$$\frac{\varphi J(\theta)}{\varphi \theta_{s}} =(-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot \sum_{j=1,j\neq s}^{k} 1\left{y_{i}=j \right} \frac{1}{p\left(y_{i}=j | x_{i}; \theta\right)}\frac{\varphi p\left(y_{i}=j | x_{i}; \theta\right)}{\varphi \theta_{s}} +(-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot 1\left{y_{i}=s \right} \frac{1}{p\left(y_{i}=s | x_{i}; \theta\right)}\frac{\varphi p\left(y_{i}=s | x_{i}; \theta\right)}{\varphi \theta_{s}} \ =(-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot \sum_{j=1,j\neq s}^{k} 1\left{y_{i}=j \right} \frac{1}{p\left(y_{i}=j | x_{i}; \theta\right)})\cdot (-1)\cdot p\left(y_{i}=s | x_{i}; \theta\right)p\left(y_{i}=j | x_{i}; \theta\right)\cdot x_{i} + (-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot 1\left{y_{i}=s \right} \frac{1}{p\left(y_{i}=s | x_{i}; \theta\right)}\left[p\left(y_{i}=s | x_{i}; \theta\right)\cdot x_{i}-p\left(y_{i}=s | x_{i}; \theta\right)^2\cdot x_{i}\right] \ =(-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot \sum_{j=1,j\neq s}^{k} 1\left{y_{i}=j \right}\cdot (-1)\cdot p\left(y_{i}=s | x_{i}; \theta\right)\cdot x_{i} + (-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot 1\left{y_{i}=s \right} \left[x_{i}-p\left(y_{i}=s | x_{i}; \theta\right)\cdot x_{i}\right] \ =(-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot 1\left{y_{i}=s \right} x_{i} - (-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot \sum_{j=1}^{k} 1\left{y_{i}=j \right} p\left(y_{i}=s | x_{i}; \theta\right)\cdot x_{i}$$

$$\frac{\varphi J(\theta)}{\varphi \theta_{s}} =(-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot \left[ 1\left{y_{i}=s \right} - p\left(y_{i}=s | x_{i}; \theta\right) \right]\cdot x_{i}$$

## 梯度下降

$$\frac{\varphi J(\theta)}{\varphi \theta} =\frac{1}{m}\cdot \sum_{i=1}^{m}\cdot \begin{bmatrix} (-1)\cdot\left[ 1\left{y=1 \right} - p\left(y=1 | x; \theta\right) \right]\cdot x\ (-1)\cdot\left[ 1\left{y=2 \right} - p\left(y=2 | x; \theta\right) \right]\cdot x\ …\ (-1)\cdot\left[ 1\left{y=k \right} - p\left(y=k | x; \theta\right) \right]\cdot x \end{bmatrix} =(-1)\cdot \frac{1}{m}\cdot X_{m\times n+1}^T \cdot (I_{m\times k} - Y_{m\times k})$$

Softmax regression for Iris classification

Derivative of Softmax loss function

## 参数冗余和权重衰减

softmax回归存在参数冗余现象，即对参数向量$\theta_{j}$减去向量$\varphi$不改变预测结果。证明如下：

\begin{aligned} p\left(y^{(i)}=j | x^{(i)} ; \theta\right) &=\frac{e^{\left(\theta_{j}-\psi\right)^{T} x^{(i)}}}{\sum_{l=1}^{k} e^{\left(\theta_{l}-\psi\right)^{T} x^{(i)}}} \ &=\frac{e^{\theta_{j}^{T} x^{(i)}} e^{-\psi^{T} x^{(i)}}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x^{(i)}} e^{-\psi^{T} x^{(i)}}} \ &=\frac{e^{\theta_{j}^{T} x^{(i)}}}{\sum_{l=1}^{k} e^{\theta_{t}^{T} x^{(i)}}} \end{aligned}

$$J(\theta) = (-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left{y_{i}=j\right} \ln p\left(y_{i}=j | x_{i}; \theta\right) + \frac{\lambda}{2} \sum_{i=1}^{k} \sum_{j=0}^{n} \theta_{i j}^{2} = (-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left{y_{i}=j\right} \ln \frac{e^{\theta_{j}^{T} x_{i}}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x_{i}}} + \frac{\lambda}{2} \sum_{i=1}^{k} \sum_{j=0}^{n} \theta_{i j}^{2}$$

$$\frac{\varphi J(\theta)}{\varphi \theta_{s}} =(-1)\cdot \frac{1}{m}\cdot \sum_{i=1}^{m}\cdot \left[ 1\left{y_{i}=s \right} - p\left(y_{i}=s | x_{i}; \theta\right) \right]\cdot x_{i}+ \lambda \theta_{j}$$

## 鸢尾数据集

4个变量：

• SepalLengthCm - 花萼长度
• SepalWidthCm - 花萼宽度
• PetalLengthCm - 花瓣长度
• PetalWidthCm - 花瓣宽度

• Iris-setosa
• Iris-versicolor
• Iris-virginica

## numpy实现

### 指数计算 - 数值稳定性考虑

softmax回归中，需要利用指数函数$e^x$对线性操作的结果进行归一化，这有可能会造成数值溢出，常用的做法是对分数上下同乘以一个常数$C$

$$\frac{e^{f_{i_{i}}}}{\sum_{j} e^{f_{j}}}=\frac{C e^{f_{y_{i}}}}{C \sum_{j} e^{f_{j}}}=\frac{e^{f_{i_{i}}+\log C}}{\sum_{j} e^{f_{j}+\log C}}$$

## softmax回归和logistic回归

softmax回归是logistic回归在多分类任务上的扩展，将$k=2$时，softmax回归模型可转换成logistic回归模型

$$h_{\theta}(x)=\frac{1}{e^{\theta_{1}^{T} x}+e^{\theta_{2}^{T} x^{(i)}}} \left[ \begin{array}{c}{e^{\theta_{1}^{T} x}} \ {e^{\theta_{2}^{T} x}}\end{array}\right] =\frac{1}{e^{\vec{0}^{T} x}+e^{(\theta_{2}-\theta_{1})^{T} x^{(i)}}} \left[ \begin{array}{c}{e^{\vec{0}^{T} x}} \ {e^{(\theta_{2}-\theta_{1})^{T} x}}\end{array}\right] \ =\frac{1}{1+e^{(\theta_{2}-\theta_{1})^{T} x^{(i)}}} \left[ \begin{array}{c}{1} \ {e^{(\theta_{2}-\theta_{1})^{T} x}}\end{array}\right] = \left[ \begin{array}{c}{\frac{1}{1+e^{(\theta_{2}-\theta_{1})^{T} x^{(i)}}}} \ {\frac{e^{(\theta_{2}-\theta_{1})^{T} x}}{1+e^{(\theta_{2}-\theta_{1})^{T} x^{(i)}}}}\end{array}\right] =\left[ \begin{array}{c}{\frac{1}{1+e^{(\theta_{2}-\theta_{1})^{T} x^{(i)}}}} \ {1- \frac{1}{1+e^{(\theta_{2}-\theta_{1})^{T} x^{(i)}}}}\end{array}\right]$$