第7章 贝叶斯分类器
本章是从概率框架下的贝叶斯视角给出机器学习问题的建模方法,不同于前几章着重于算法具体实现,本章的理论性会更强。朴素贝叶斯算法常用于文本分类,例如用于广告邮件检测,贝叶斯网和EM算法均属于概率图模型的范畴,因此可合并至第14章一起学习。
7.1 贝叶斯决策论
7.1.1 式(7.5)的推导
由式(7.1)和式(7.4)可得
R(ci∣x)=1∗P(c1∣x)+...+1∗P(ci−1∣x)+0∗P(ci∣x)+1∗P(ci+1∣x)+...+1∗P(cN∣x)R(c_i|\boldsymbol x)=1*P(c_1|\boldsymbol x)+...+1*P(c_{i-1}|\boldsymbol x)+0*P(c_i|\boldsymbol x)+1*P(c_{i+1}|\boldsymbol x)+...+1*P(c_N|\boldsymbol x)
又∑j=1NP(cj∣x)=1\sum_{j=1}^{N}P(c_j|\boldsymbol x)=1,则
R(ci∣x)=1−P(ci∣x)R(c_i|\boldsymbol x)=1-P(c_i|\boldsymbol x)
此即式(7.5)。
7.1.2 式(7.6)的推导
将式(7.5)代入式(7.3)即可推得此式
7.1.3 判别式模型与生成式模型
对于判别式模型来说,就是在已知x\boldsymbol{x}的条件下判别其类别标记cc,即求后验概率P(c∣x)P(c|\boldsymbol{x}),前几章介绍的模型都属于判别式模型的范畴,尤其是对数几率回归最为直接明了,式(3.23)和式(3.24)直接就是后验概率的形式。
对于生成式模型来说,理解起来比较抽象,但是可通过思考以下两个问题来理解。
(1)对于数据集来说,其中的样本是如何生成的?通常假设数据集中的样本服从独立同分布,即每个样本都是按照联合概率分布P(x,c)P(\boldsymbol{x},c)采样而得,也可以描述为根据P(x,c)P(\boldsymbol{x},c)生成的。
(2)若已知样本x\boldsymbol{x}和联合概率分布P(x,c)P(\boldsymbol{x},c),如何预测类别呢?若样本x\boldsymbol{x}和联合概率分布P(x,c)P(\boldsymbol{x},c)已知,则可以分别求出x\boldsymbol{x}属于各个类别的概率,即P(x,c1),P(x,c2),...,P(x,cN)P(\boldsymbol{x},c_1),P(\boldsymbol{x},c_2),...,P(\boldsymbol{x},c_N),然后选择概率最大的类别作为样本x\boldsymbol{x}的预测结果。
因此,之所以称为"生成式"模型,是因为所求的概率P(x,c)P(\boldsymbol{x},c)是生成样本x\boldsymbol{x}的概率。
7.2 极大似然估计
7.2.1 式(7.12)和(7.13)的推导
根据式(7.11)和式(7.10)可知参数求解式为
θ^c=argmaxθcLL(θc)=argminθc−LL(θc)=argminθc−∑x∈DclogP(x∣θc)\begin{aligned} \hat{\boldsymbol{\theta}}_{c}&=\underset{\boldsymbol{\theta}_{c}}{\arg \max } LL\left(\boldsymbol{\theta}_{c}\right) \\ &=\underset{\boldsymbol{\theta}_{c}}{\arg \min } -LL\left(\boldsymbol{\theta}_{c}\right) \\ &= \underset{\boldsymbol{\theta}_{c}}{\arg \min }-\sum_{\boldsymbol{x} \in D_{c}} \log P\left(\boldsymbol{x} | \boldsymbol{\theta}_{c}\right) \end{aligned}
由"西瓜书"上下文可知,此时假设概率密度函数p(x∣c)∼N(μc,σc2)p(\boldsymbol{x} | c) \sim \mathcal{N}\left(\boldsymbol{\mu}_{c}, \boldsymbol{\sigma}_{c}^{2}\right),其等价于假设
P(x∣θc)=P(x∣μc,σc2)=1(2π)d∣Σc∣exp(−12(x−μc)TΣc−1(x−μc))P\left(\boldsymbol{x} | \boldsymbol{\theta}_{c}\right)=P\left(\boldsymbol{x} | \boldsymbol{\mu}_{c}, \boldsymbol{\sigma}_{c}^{2}\right)=\frac{1}{\sqrt{(2 \pi)^{d}|\boldsymbol{\Sigma}_c|}} \exp \left(-\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu}_c)^{\mathrm{T}} \boldsymbol{\Sigma}_c^{-1}(\boldsymbol{x}-\boldsymbol{\mu}_c)\right)
其中,dd表示x\boldsymbol{x}的维数,Σc=σc2\boldsymbol{\Sigma}_c=\boldsymbol{\sigma}_{c}^{2}为对称正定协方差矩阵,∣Σc∣|\boldsymbol{\Sigma}_c|表示Σc\boldsymbol{\Sigma}_c的行列式。将其代入参数求解式可得
(μ^c,Σ^c)=argmin(μc,Σc)−∑x∈Dclog[1(2π)d∣Σc∣exp(−12(x−μc)TΣc−1(x−μc))]=argmin(μc,Σc)−∑x∈Dc[−d2log(2π)−12log∣Σc∣−12(x−μc)TΣc−1(x−μc)]=argmin(μc,Σc)∑x∈Dc[d2log(2π)+12log∣Σc∣+12(x−μc)TΣc−1(x−μc)]=argmin(μc,Σc)∑x∈Dc[12log∣Σc∣+12(x−μc)TΣc−1(x−μc)]\begin{aligned} (\hat{\boldsymbol{\mu}}_{c}, \hat{\boldsymbol{\Sigma}}_{c})&= \underset{(\boldsymbol{\mu}_{c},\boldsymbol{\Sigma}_c)}{\arg \min }-\sum_{\boldsymbol{x} \in D_{c}} \log\left[\frac{1}{\sqrt{(2 \pi)^{d}|\boldsymbol{\Sigma}_c|}} \exp \left(-\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu}_c)^{\mathrm{T}} \boldsymbol{\Sigma}_c^{-1}(\boldsymbol{x}-\boldsymbol{\mu}_c)\right)\right] \\ &= \underset{(\boldsymbol{\mu}_{c},\boldsymbol{\Sigma}_c)}{\arg \min }-\sum_{\boldsymbol{x} \in D_{c}} \left[-\frac{d}{2}\log(2 \pi)-\frac{1}{2}\log|\boldsymbol{\Sigma}_c|-\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu}_c)^{\mathrm{T}} \boldsymbol{\Sigma}_c^{-1}(\boldsymbol{x}-\boldsymbol{\mu}_c)\right] \\ &= \underset{(\boldsymbol{\mu}_{c},\boldsymbol{\Sigma}_c)}{\arg \min }\sum_{\boldsymbol{x} \in D_{c}} \left[\frac{d}{2}\log(2 \pi)+\frac{1}{2}\log|\boldsymbol{\Sigma}_c|+\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu}_c)^{\mathrm{T}} \boldsymbol{\Sigma}_c^{-1}(\boldsymbol{x}-\boldsymbol{\mu}_c)\right] \\ &= \underset{(\boldsymbol{\mu}_{c},\boldsymbol{\Sigma}_c)}{\arg \min }\sum_{\boldsymbol{x} \in D_{c}} \left[\frac{1}{2}\log|\boldsymbol{\Sigma}_c|+\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu}_c)^{\mathrm{T}} \boldsymbol{\Sigma}_c^{-1}(\boldsymbol{x}-\boldsymbol{\mu}_c)\right] \\ \end{aligned}
假设此时数据集DcD_c中的样本个数为nn,即∣Dc∣=n|D_c|=n,则上式可以改写为
(μ^c,Σ^c)=argmin(μc,Σc)∑i=1n[12log∣Σc∣+12(xi−μc)TΣc−1(xi−μc)]=argmin(μc,Σc)n2log∣Σc∣+∑i=1n12(xi−μc)TΣc−1(xi−μc)\begin{aligned} (\hat{\boldsymbol{\mu}}_{c}, \hat{\boldsymbol{\Sigma}}_{c})&=\underset{(\boldsymbol{\mu}_{c},\boldsymbol{\Sigma}_c)}{\arg \min }\sum_{i=1}^{n} \left[\frac{1}{2}\log|\boldsymbol{\Sigma}_c|+\frac{1}{2}(\boldsymbol{x}_{i}-\boldsymbol{\mu}_c)^{\mathrm{T}} \boldsymbol{\Sigma}_c^{-1}(\boldsymbol{x}_{i}-\boldsymbol{\mu}_c)\right]\\ &=\underset{(\boldsymbol{\mu}_{c},\boldsymbol{\Sigma}_c)}{\arg \min }\frac{n}{2}\log|\boldsymbol{\Sigma}_c|+\sum_{i=1}^{n}\frac{1}{2}(\boldsymbol{x}_i-\boldsymbol{\mu}_c)^{\mathrm{T}} \boldsymbol{\Sigma}_c^{-1}(\boldsymbol{x}_i-\boldsymbol{\mu}_c)\\ \end{aligned}
为了便于分别求解μ^c\hat{\boldsymbol{\mu}}_{c}和Σ^c\hat{\boldsymbol{\Sigma}}_{c},在这里我们根据式xTAx=tr(AxxT),xˉ=1n∑i=1nxi\boldsymbol{x}^{\mathrm{T}}\mathbf{A}\boldsymbol{x}=\operatorname{tr}(\mathbf{A}\boldsymbol{x}\boldsymbol{x}^{\mathrm{T}}),\bar{\boldsymbol{x}}=\frac{1}{n}\sum_{i=1}^{n}\boldsymbol{x}_i将上式中的最后一项作如下恒等变形:
∑i=1n12(xi−μc)TΣc−1(xi−μc)=12tr[Σc−1∑i=1n(xi−μc)(xi−μc)T]=12tr[Σc−1∑i=1n(xixiT−xiμcT−μcxiT+μcμcT)]=12tr[Σc−1(∑i=1nxixiT−nxˉμcT−nμcxˉT+nμcμcT)]=12tr[Σc−1(∑i=1nxixiT−2nxˉμcT+nμcμcT+2nxˉxˉT−2nxˉxˉT)]=12tr[Σc−1((∑i=1nxixiT−2nxˉxˉT+nxˉxˉT)+(nμcμcT−2nxˉμcT+nxˉxˉT))]=12tr[Σc−1(∑i=1n(xi−xˉ)(xi−xˉ)T+∑i=1n(μc−xˉ)(μc−xˉ)T)]=12tr[Σc−1∑i=1n(xi−xˉ)(xi−xˉ)T]+12tr[Σc−1∑i=1n(μc−xˉ)(μc−xˉ)T]=12tr[Σc−1∑i=1n(xi−xˉ)(xi−xˉ)T]+12tr[n⋅Σc−1(μc−xˉ)(μc−xˉ)T]=12tr[Σc−1∑i=1n(xi−xˉ)(xi−xˉ)T]+n2tr[Σc−1(μc−xˉ)(μc−xˉ)T]=12tr[Σc−1∑i=1n(xi−xˉ)(xi−xˉ)T]+n2(μc−xˉ)TΣc−1(μc−xˉ)\begin{aligned} &\sum_{i=1}^{n}\frac{1}{2}(\boldsymbol{x}_i-\boldsymbol{\mu}_c)^{\mathrm{T}} \boldsymbol{\Sigma}_c^{-1}(\boldsymbol{x}_i-\boldsymbol{\mu}_c)\\ =&\frac{1}{2}\operatorname{tr}\left[\boldsymbol{\Sigma}_c^{-1}\sum_{i=1}^{n}(\boldsymbol{x}_i-\boldsymbol{\mu}_c)(\boldsymbol{x}_i-\boldsymbol{\mu}_c)^{\mathrm{T}}\right]\\ =&\frac{1}{2}\operatorname{tr}\left[\boldsymbol{\Sigma}_c^{-1}\sum_{i=1}^{n}\left(\boldsymbol{x}_i\boldsymbol{x}_i^{\mathrm{T}}-\boldsymbol{x}_i\boldsymbol{\mu}_c^{\mathrm{T}}-\boldsymbol{\mu}_c\boldsymbol{x}_i^{\mathrm{T}}+\boldsymbol{\mu}_c\boldsymbol{\mu}_c^{\mathrm{T}}\right)\right]\\ =&\frac{1}{2}\operatorname{tr}\left[\boldsymbol{\Sigma}_c^{-1}\left(\sum_{i=1}^{n}\boldsymbol{x}_i\boldsymbol{x}_i^{\mathrm{T}}-n\bar{\boldsymbol{x}}\boldsymbol{\mu}_c^{\mathrm{T}}-n\boldsymbol{\mu}_c\bar{\boldsymbol{x}}^{\mathrm{T}}+n\boldsymbol{\mu}_c\boldsymbol{\mu}_c^{\mathrm{T}}\right)\right]\\ =&\frac{1}{2}\operatorname{tr}\left[\boldsymbol{\Sigma}_c^{-1}\left(\sum_{i=1}^{n}\boldsymbol{x}_i\boldsymbol{x}_i^{\mathrm{T}}-2n\bar{\boldsymbol{x}}\boldsymbol{\mu}_c^{\mathrm{T}}+n\boldsymbol{\mu}_c\boldsymbol{\mu}_c^{\mathrm{T}}+2n\bar{\boldsymbol{x}}\bar{\boldsymbol{x}}^{\mathrm{T}}-2n\bar{\boldsymbol{x}}\bar{\boldsymbol{x}}^{\mathrm{T}}\right)\right]\\ =&\frac{1}{2}\operatorname{tr}\left[\boldsymbol{\Sigma}_c^{-1}\left(\left(\sum_{i=1}^{n}\boldsymbol{x}_i\boldsymbol{x}_i^{\mathrm{T}}-2n\bar{\boldsymbol{x}}\bar{\boldsymbol{x}}^{\mathrm{T}}+n\bar{\boldsymbol{x}}\bar{\boldsymbol{x}}^{\mathrm{T}}\right)+\left(n\boldsymbol{\mu}_c\boldsymbol{\mu}_c^{\mathrm{T}}-2n\bar{\boldsymbol{x}}\boldsymbol{\mu}_c^{\mathrm{T}}+n\bar{\boldsymbol{x}}\bar{\boldsymbol{x}}^{\mathrm{T}}\right)\right)\right]\\ =&\frac{1}{2}\operatorname{tr}\left[\boldsymbol{\Sigma}_c^{-1}\left(\sum_{i=1}^{n}(\boldsymbol{x}_i-\bar{\boldsymbol{x}})(\boldsymbol{x}_i-\bar{\boldsymbol{x}})^{\mathrm{T}}+\sum_{i=1}^{n}(\boldsymbol{\mu}_c-\bar{\boldsymbol{x}})(\boldsymbol{\mu}_c-\bar{\boldsymbol{x}})^{\mathrm{T}}\right)\right]\\ =&\frac{1}{2}\operatorname{tr}\left[\boldsymbol{\Sigma}_c^{-1}\sum_{i=1}^{n}(\boldsymbol{x}_i-\bar{\boldsymbol{x}})(\boldsymbol{x}_i-\bar{\boldsymbol{x}})^{\mathrm{T}}\right]+\frac{1}{2}\operatorname{tr}\left[\boldsymbol{\Sigma}_c^{-1}\sum_{i=1}^{n}(\boldsymbol{\mu}_c-\bar{\boldsymbol{x}})(\boldsymbol{\mu}_c-\bar{\boldsymbol{x}})^{\mathrm{T}}\right]\\ =&\frac{1}{2}\operatorname{tr}\left[\boldsymbol{\Sigma}_c^{-1}\sum_{i=1}^{n}(\boldsymbol{x}_i-\bar{\boldsymbol{x}})(\boldsymbol{x}_i-\bar{\boldsymbol{x}})^{\mathrm{T}}\right]+\frac{1}{2}\operatorname{tr}\left[n\cdot\boldsymbol{\Sigma}_c^{-1}(\boldsymbol{\mu}_c-\bar{\boldsymbol{x}})(\boldsymbol{\mu}_c-\bar{\boldsymbol{x}})^{\mathrm{T}}\right]\\ =&\frac{1}{2}\operatorname{tr}\left[\boldsymbol{\Sigma}_c^{-1}\sum_{i=1}^{n}(\boldsymbol{x}_i-\bar{\boldsymbol{x}})(\boldsymbol{x}_i-\bar{\boldsymbol{x}})^{\mathrm{T}}\right]+\frac{n}{2}\operatorname{tr}\left[\boldsymbol{\Sigma}_c^{-1}(\boldsymbol{\mu}_c-\bar{\boldsymbol{x}})(\boldsymbol{\mu}_c-\bar{\boldsymbol{x}})^{\mathrm{T}}\right]\\ =&\frac{1}{2}\operatorname{tr}\left[\boldsymbol{\Sigma}_c^{-1}\sum_{i=1}^{n}(\boldsymbol{x}_i-\bar{\boldsymbol{x}})(\boldsymbol{x}_i-\bar{\boldsymbol{x}})^{\mathrm{T}}\right]+\frac{n}{2}(\boldsymbol{\mu}_c-\bar{\boldsymbol{x}})^{\mathrm{T}} \boldsymbol{\Sigma}_c^{-1}(\boldsymbol{\mu}_c-\bar{\boldsymbol{x}}) \end{aligned}
所以
(μ^c,Σ^c)=argmin(μc,Σc)n2log∣Σc∣+12tr[Σc−1∑i=1n(xi−xˉ)(xi−xˉ)T]+n2(μc−xˉ)TΣc−1(μc−xˉ)(\hat{\boldsymbol{\mu}}_{c}, \hat{\boldsymbol{\Sigma}}_{c})=\underset{(\boldsymbol{\mu}_{c},\boldsymbol{\Sigma}_c)}{\arg \min }\frac{n}{2}\log|\boldsymbol{\Sigma}_c|+\frac{1}{2}\operatorname{tr}\left[\boldsymbol{\Sigma}_{c}^{-1}\sum_{i=1}^{n}(\boldsymbol{x}_i-\bar{\boldsymbol{x}})(\boldsymbol{x}_i-\bar{\boldsymbol{x}})^{\mathrm{T}}\right]+\frac{n}{2}(\boldsymbol{\mu}_c-\bar{\boldsymbol{x}})^{\mathrm{T}} \boldsymbol{\Sigma}_c^{-1}(\boldsymbol{\mu}_c-\bar{\boldsymbol{x}})
观察上式可知,由于此时Σc−1\boldsymbol{\Sigma}_c^{-1}和Σc\boldsymbol{\Sigma}_c一样均为正定矩阵,所以当μc−xˉ≠0\boldsymbol{\mu}_c-\bar{\boldsymbol{x}}\neq\boldsymbol{0}时,上式最后一项为正定二次型。根据正定二次型的性质可知,此时上式最后一项的取值仅与μc−xˉ\boldsymbol{\mu}_c-\bar{\boldsymbol{x}}相关,并有当且仅当μc−xˉ=0\boldsymbol{\mu}_c-\bar{\boldsymbol{x}}=\boldsymbol{0}时,上式最后一项取最小值0,此时可以解得
μ^c=xˉ=1n∑i=1nxi\hat{\boldsymbol{\mu}}_{c}=\bar{\boldsymbol{x}}=\frac{1}{n}\sum_{i=1}^{n}\boldsymbol{x}_i
将求解出来的μ^c\hat{\boldsymbol{\mu}}_{c}代回参数求解式可得新的参数求解式,有
Σ^c=argminΣcn2log∣Σc∣+12tr[Σc−1∑i=1n(xi−xˉ)(xi−xˉ)T]\hat{\boldsymbol{\Sigma}}_{c}=\underset{\boldsymbol{\Sigma}_c}{\arg \min }\frac{n}{2}\log|\boldsymbol{\Sigma}_c|+\frac{1}{2}\operatorname{tr}\left[\boldsymbol{\Sigma}_{c}^{-1}\sum_{i=1}^{n}(\boldsymbol{x}_i-\bar{\boldsymbol{x}})(\boldsymbol{x}_i-\bar{\boldsymbol{x}})^{\mathrm{T}}\right]
此时的参数求解式是仅与Σc\boldsymbol{\Sigma}_c相关的函数。
为了求解Σ^c\hat{\boldsymbol{\Sigma}}_{c},在这里我们不加证明地给出一个引理:设B\mathbf{B}为pp阶正定矩阵,n>0n>0为实数,在对所有pp阶正定矩阵Σ\boldsymbol{\Sigma}有
n2log∣Σ∣+12tr[Σ−1B]≥n2log∣B∣+pn2(1−logn)\frac{n}{2}\log|\boldsymbol{\Sigma}|+\frac{1}{2}\operatorname{tr}\left[\boldsymbol{\Sigma}^{-1}\mathbf{B}\right]\geq\frac{n}{2}\log|\mathbf{B}|+\frac{pn}{2}(1-\log n)
当且仅当Σ=1nB\boldsymbol{\Sigma}=\frac{1}{n}\mathbf{B}时等号成立。(引理的证明可搜索张伟平老师的"多元正态分布参数的估计和数据的清洁与变换"课件)
根据此引理可知,当且仅当Σc=1n∑i=1n(xi−xˉ)(xi−xˉ)T\boldsymbol{\Sigma}_c=\frac{1}{n}\sum_{i=1}^{n}(\boldsymbol{x}_i-\bar{\boldsymbol{x}})(\boldsymbol{x}_i-\bar{\boldsymbol{x}})^{\mathrm{T}} 时,上述参数求解式中argmin\arg \min后面的式子取到最小值,那么此时的Σc\boldsymbol{\Sigma}_c即我们想要求解的Σ^c\hat{\boldsymbol{\Sigma}}_{c}。
7.3 朴素贝叶斯分类器
7.3.1 式(7.16)和式(7.17)的解释
该式是基于大数定律的频率近似概率的思路,而该思路的本质仍然是极大似然估计,下面举例说明。以掷硬币为例,假设投掷硬币5次,结果依次是正面、正面、反面、正面、反面,试基于此观察结果估计硬币正面朝上的概率。
设硬币正面朝上的概率为θ\theta,其服从伯努利分布,因此反面朝上的概率为1−θ1-\theta,同时设每次投掷结果相互独立,即独立同分布,则似然为
L(θ)=θ⋅θ⋅(1−θ)⋅θ⋅(1−θ)=θ3(1−θ)2\begin{aligned} L(\theta)&=\theta\cdot\theta\cdot(1-\theta)\cdot\theta\cdot(1-\theta)\\ &=\theta^{3}(1-\theta)^2 \end{aligned}
对数似然为
LL(θ)=lnL(θ)=3lnθ+2ln(1−θ)LL(\theta)=\ln L(\theta)=3\ln\theta+2\ln (1-\theta)
易证LL(θ)LL(\theta)是关于θ\theta的凹函数,因此对其求一阶导并令导数等于零即可求出最大值点,具体地
∂LL(θ)∂θ=∂(3lnθ+2ln(1−θ))∂θ=3θ−21−θ=3−5θθ(1−θ)\begin{aligned} \frac{\partial LL(\theta)}{\partial\theta}&=\frac{\partial\left(3\ln\theta+2\ln (1-\theta)\right)}{\partial\theta}\\ &=\frac{3}{\theta}-\frac{2}{1-\theta}\\ &=\frac{3-5\theta}{\theta(1-\theta)} \end{aligned}
令上式等于0可解得θ=35\theta=\frac{3}{5},显然35\frac{3}{5}也是正面出现的频率。
7.3.2 式(7.18)的解释
该式所表示的正态分布并不一定是标准正态分布,因此p(xi∣c)p(x_i|c)的取值并不一定在(0,1)(0,1)之间,但是仍然不妨碍其用作"概率",因为根据朴素贝叶斯的算法原理可知,只需p(xi∣c)p(x_i|c)的值仅仅是用来比大小,因此只关心相对值而不关心绝对值。
7.3.3 贝叶斯估计
贝叶斯学派视角下的一类点估计法称为贝叶斯估计[1],常用的贝叶斯估计有最大后验估计(Maximum A Posteriori Estimation,简称MAP)、后验中位数估计和后验期望值估计这3种参数估计方法,下面给出这3种方法的具体定义。
设总体的概率质量函数(若总体的分布为连续型时则改为概率密度函数,此处以离散型为例) 为P(x∣θ)P(x|\theta),从该总体中抽取出的nn个独立同分布的样本构成样本集 D={x1,x2,⋯ ,xn}D=\{x_1,x_2,\cdots,x_n\},则根据贝叶斯式可得,在给定样本集DD的条件下,θ\theta的条件概率为
P(θ∣D)=P(D∣θ)P(θ)P(D)=P(D∣θ)P(θ)∑θP(D∣θ)P(θ)P(\theta|D)=\frac{P(D|\theta)P(\theta)}{P(D)}=\frac{P(D|\theta)P(\theta)} {\sum_{\theta}P(D|\theta)P(\theta)}
其中P(D∣θ)P(D|\theta)为似然函数,由于样本集DD中的样本是独立同分布的,所以似然函数可以进一步展开,有
P(θ∣D)=P(D∣θ)P(θ)∑θP(D∣θ)P(θ)=∏i=1nP(xi∣θ)P(θ)∑θ∏i=1nP(xi∣θ)P(θ)P(\theta|D)=\frac{P(D|\theta)P(\theta)} {\sum_{\theta}P(D|\theta)P(\theta)}=\frac{\prod_{i=1}^{n}P(x_i|\theta) P(\theta)}{\sum_{\theta}\prod_{i=1}^{n}P(x_i|\theta)P(\theta)}
根据贝叶斯学派的观点,此条件概率代表了我们在已知样本集DD后对θ\theta产生的新的认识,它综合了我们对θ\theta主观预设的先验概率P(θ)P(\theta)和样本集DD带来的信息,通常称其为θ\theta的后验概率。
贝叶斯学派认为,在得到P(θ∣D)P(\theta|D)以后,对参数θ\theta的任何统计推断,都只能基于P(θ∣D)P(\theta|D)。至于具体如何去使用它,可以结合某种准则一起去进行,统计学家也有一定的自由度。 对于点估计来说, 求使得P(θ∣D)P(\theta|D)达到最大值的θ^MAP\hat{\theta}_{\mathrm{MAP}}作为θ\theta的估计称为最大后验估计,求P(θ∣D)P(\theta|D)的中位数θ^Median\hat{\theta}_{\mathrm{Median}}作为θ\theta的估计称为后验中位数估计,求P(θ∣D)P(\theta|D)的期望值(均值)θ^Mean\hat{\theta}_{\mathrm{ Mean}}作为θ\theta的估计称为后验期望值估计。
7.3.4 Categorical分布
Categorical分布又称为广义伯努利分布,是将伯努利分布中的随机变量可取值个数由两个泛化为多个得到的分布。具体地,设离散型随机变量XX共有kk种可能的取值{x1,x2,⋯ ,xk}\{x_1,x_2,\cdots,x_k\},且XX取到每个值的概率分别为P(X=x1)=θ1,P(X=x2)=θ2,⋯ ,P(X=xk)=θkP(X=x_1)=\theta_1,P(X=x_2)=\theta_2,\cdots,P(X=x_k)=\theta_k,则称随机变量XX服从参数为θ1,θ2,⋯ ,θk\theta_1,\theta_2,\cdots,\theta_k的Categorical分布,其概率质量函数为
P(X=xi)=p(xi)=θiP(X=x_i)=p(x_i)=\theta_i
7.3.5 Dirichlet分布
类似于Categorical分布是伯努利分布的泛化形式,Dirichlet分布是Beta分布的泛化形式。对于一个kk维随机变量x=(x1,x2,⋯ ,xk)∈Rk\boldsymbol{x}=(x_1,x_2,\cdots,x_k)\in \mathbb{R}^{k},其中xi(i=1,2,⋯ ,k)x_i(i=1,2,\cdots,k)满足0⩽xi⩽1,∑i=1kxi=10\leqslant x_i \leqslant 1,\sum_{i=1}^{k}x_i=1,若x\boldsymbol{x}服从参数为α=(α1,α2,⋯ ,αk)∈Rk\boldsymbol{\alpha}=(\alpha_1,\alpha_2,\cdots,\alpha_k)\in \mathbb{R}^{k}的Dirichlet分布,则其概率密度函数为
p(x;α)=Γ(∑i=1kαi)∏i=1kΓ(αi)∏i=1kxiαi−1p(\boldsymbol{x};\boldsymbol{\alpha})=\frac{\Gamma \left(\sum _{i=1}^{k}\alpha _{i}\right)} {\prod _{i=1}^{k}\Gamma (\alpha _{i})}\prod _{i=1}^{k}x_{i}^{\alpha _{i}-1}
其中Γ(z)=∫0∞xz−1e−xdx\Gamma (z)=\int _{0}^{\infty }x^{z-1}e^{-x}{\rm d}x为Gamma函数,当α=(1,1,⋯ ,1)\boldsymbol{\alpha}=(1,1,\cdots,1)时,Dirichlet分布等价于均匀分布。
7.3.6 式(7.19)和式(7.20)的推导
从贝叶斯估计的角度来说,拉普拉斯修正就等价于先验概率为Dirichlet分布的后验期望值估计。为了接下来的叙述方便,我们重新定义一下相关数学符号。
设有包含mm个独立同分布样本的训练集DD,DD中可能的类别数为kk,其类别的具体取值范围为{c1,c2,...,ck}\{c_1,c_2,...,c_k\}。若令随机变量CC表示样本所属的类别,且CC取到每个值的概率分别为P(C=c1)=θ1,P(C=c2)=θ2,...,P(C=ck)=θkP(C=c_1)=\theta_1,P(C=c_2)=\theta_2,...,P(C=c_k)=\theta_k,那么显然CC服从参数为θ=(θ1,θ2,...,θk)∈Rk\boldsymbol{\theta}=(\theta_1,\theta_2,...,\theta_k)\in\mathbb{R}^{k}的Categorical分布,其概率质量函数为
P(C=ci)=P(ci)=θiP(C=c_i)=P(c_i)=\theta_i
其中P(ci)=θiP(c_i)=\theta_i就是式(7.9)所要求解的P^(c)\hat{P}(c),下面我们用贝叶斯估计中的后验期望值估计来估计θi\theta_i。根据贝叶斯估计的原理可知,在进行参数估计之前,需要先主观预设一个先验概率P(θ)P(\boldsymbol{\theta}),通常为了方便计算后验概率P(θ∣D)P(\boldsymbol{\theta}|D),我们会用似然函数P(D∣θ)P(D|\boldsymbol{\theta})的共轭先验作为我们的先验概率。显然,此时的似然函数P(D∣θ)P(D|\boldsymbol{\theta})是一个基于Categorical分布的似然函数,而Categorical分布的共轭先验为Dirichlet分布,所以只需要预设先验概率P(θ)P(\boldsymbol{\theta})为Dirichlet分布,然后使用后验期望值估计就能估计出θi\theta_i。
具体地,记DD中样本类别取值为cic_i的样本个数为yiy_i,则似然函数P(D∣θ)P(D|\boldsymbol{\theta})可展开为
P(D∣θ)=θ1y1...θkyk=∏i=1kθiyiP(D|\boldsymbol{\theta})=\theta_1^{y_1}...\theta_k^{y_k}=\prod_{i=1}^{k}\theta_i^{y_i}
则有后验概率
P(θ∣D)=P(D∣θ)P(θ)P(D)=P(D∣θ)P(θ)∑θP(D∣θ)P(θ)=∏i=1kθiyi⋅P(θ)∑θ[∏i=1kθiyi⋅P(θ)]\begin{aligned} P(\boldsymbol{\theta}|D)&=\frac{P(D|\boldsymbol{\theta})P(\boldsymbol{\theta})}{P(D)}\\ &=\frac{P(D|\boldsymbol{\theta})P(\boldsymbol{\theta})}{\sum_{\boldsymbol{\theta}} P(D|\boldsymbol{\theta})P(\boldsymbol{\theta})}\\ &=\frac{\prod_{i=1}^{k}\theta_i^{y_i}\cdot P(\boldsymbol{\theta})}{\sum_{\boldsymbol{\theta}}\left[\prod_{i=1}^{k}\theta_i^{y_i}\cdot P(\boldsymbol{\theta})\right]} \end{aligned}
假设此时先验概率P(θ)P(\boldsymbol{\theta})是参数为α=(α1,α2,...,αk)∈Rk\boldsymbol{\alpha}=(\alpha_1,\alpha_2,...,\alpha_k)\in\mathbb{R}^{k}的Dirichlet分布,则P(θ)P(\boldsymbol{\theta})可写为
P(θ;α)=Γ(∑i=1kαi)∏i=1kΓ(αi)∏i=1kθiαi−1P(\boldsymbol{\boldsymbol{\theta}};\boldsymbol{\alpha})=\frac{\Gamma \left(\sum_{i=1}^{k}\alpha_{i}\right)}{\prod_{i=1}^{k}\Gamma (\alpha_{i})}\prod_{i=1}^{k}\theta_{i}^{\alpha_{i}-1}
将其代入P(D∣θ)P(D|\boldsymbol{\theta})可得
P(θ∣D)=∏i=1kθiyi⋅P(θ)∑θ[∏i=1kθiyi⋅P(θ)]=∏i=1kθiyi⋅Γ(∑i=1kαi)∏i=1kΓ(αi)∏i=1kθiαi−1∑θ[∏i=1kθiyi⋅Γ(∑i=1kαi)∏i=1kΓ(αi)∏i=1kθiαi−1]=∏i=1kθiyi⋅Γ(∑i=1kαi)∏i=1kΓ(αi)∏i=1kθiαi−1∑θ[∏i=1kθiyi⋅∏i=1kθiαi−1]⋅Γ(∑i=1kαi)∏i=1kΓ(αi)=∏i=1kθiyi⋅∏i=1kθiαi−1∑θ[∏i=1kθiyi⋅∏i=1kθiαi−1]=∏i=1kθiαi+yi−1∑θ[∏i=1kθiαi+yi−1]\begin{aligned} P(\boldsymbol{\theta}|D)&=\dfrac{\prod_{i=1}^{k}\theta_i^{y_i} \cdot P(\boldsymbol{\theta})}{\sum_{\boldsymbol{\theta}} \left[\prod_{i=1}^{k}\theta_i^{y_i}\cdot P(\boldsymbol{\theta})\right]} \\ &=\dfrac{\prod_{i=1}^{k}\theta_i^{y_i}\cdot \dfrac{\Gamma \left(\sum _{i=1}^{k}\alpha _{i}\right)} {\prod _{i=1}^{k}\Gamma (\alpha _{i})}\prod _{i=1}^{k} \theta_{i}^{\alpha _{i}-1}}{\sum_{\boldsymbol{\theta}}\left[\prod_{i=1}^{k}\theta_i^{y_i} \cdot \dfrac{\Gamma \left(\sum _{i=1}^{k}\alpha _{i}\right)}{\prod _{i=1}^{k} \Gamma (\alpha _{i})}\prod _{i=1}^{k}\theta_{i}^{\alpha _{i}-1}\right]} \\ &=\dfrac{\prod_{i=1}^{k}\theta_i^{y_i}\cdot \dfrac{\Gamma \left(\sum _{i=1}^{k}\alpha _{i}\right)} {\prod _{i=1}^{k}\Gamma (\alpha _{i})}\prod _{i=1}^{k}\theta_{i}^{\alpha _{i}-1}} {\sum_{\boldsymbol{\theta}}\left[\prod_{i=1}^{k}\theta_i^{y_i}\cdot \prod _{i=1}^{k}\theta_{i}^{\alpha _{i}-1}\right]\cdot \dfrac{\Gamma \left(\sum _{i=1}^{k}\alpha _{i}\right)}{\prod _{i=1}^{k}\Gamma (\alpha _{i})}} \\ &=\dfrac{\prod_{i=1}^{k}\theta_i^{y_i}\cdot \prod _{i=1}^{k}\theta_{i}^{\alpha _{i}-1}} {\sum_{\boldsymbol{\theta}}\left[\prod_{i=1}^{k}\theta_i^{y_i}\cdot \prod _{i=1}^{k}\theta_{i}^{\alpha _{i}-1}\right]} \\ &=\dfrac{\prod_{i=1}^{k}\theta_i^{\alpha_{i}+y_i-1}}{\sum_{\boldsymbol{\theta}} \left[\prod_{i=1}^{k}\theta_i^{\alpha_{i}+y_i-1}\right]} \end{aligned}
此时若设α+y=(α1+y1,α2+y2,...,αk+yk)∈Rk\boldsymbol{\alpha}+\boldsymbol{y}=(\alpha_1+y_1,\alpha_2+y_2,...,\alpha_k+y_k)\in \mathbb{R}^{k},则根据Dirichlet分布的定义可知
P(θ;α+y)=Γ(∑i=1k(αi+yi))∏i=1kΓ(αi+yi)∏i=1kθiαi+yi−1∑θP(θ;α+y)=∑θΓ(∑i=1k(αi+yi))∏i=1kΓ(αi+yi)∏i=1kθiαi+yi−11=∑θΓ(∑i=1k(αi+yi))∏i=1kΓ(αi+yi)∏i=1kθiαi+yi−11=Γ(∑i=1k(αi+yi))∏i=1kΓ(αi+yi)∑θ[∏i=1kθiαi+yi−1]1∑θ[∏i=1kθiαi+yi−1]=Γ(∑i=1k(αi+yi))∏i=1kΓ(αi+yi)\begin{aligned} P(\boldsymbol{\theta};\boldsymbol{\alpha}+\boldsymbol{y})&= \dfrac{\Gamma \left(\sum _{i=1}^{k}(\alpha_{i}+y_i)\right)}{\prod _{i=1}^{k}\Gamma (\alpha_{i}+y_i)}\prod _{i=1}^{k}\theta_{i}^{\alpha_{i}+y_i-1} \\ \sum_{\boldsymbol{\theta}}P(\boldsymbol{\theta};\boldsymbol{\alpha}+\boldsymbol{y})&=\sum_{\boldsymbol{\theta}}\frac{\Gamma \left(\sum _{i=1}^{k}(\alpha_{i}+y_i)\right)}{\prod _{i=1}^{k}\Gamma (\alpha_{i}+y_i)}\prod _{i=1}^{k}\theta_{i}^{\alpha_{i}+y_i-1}\\ 1&=\sum_{\boldsymbol{\theta}}\frac{\Gamma \left(\sum _{i=1}^{k}(\alpha_{i}+y_i)\right)}{\prod _{i=1}^{k}\Gamma (\alpha_{i}+y_i)}\prod _{i=1}^{k}\theta_{i}^{\alpha_{i}+y_i-1} \\ 1&=\frac{\Gamma \left(\sum _{i=1}^{k}(\alpha_{i}+y_i)\right)}{\prod _{i=1}^{k}\Gamma (\alpha_{i}+y_i)}\sum_{\boldsymbol{\theta}}\left[\prod _{i=1}^{k}\theta_{i}^{\alpha_{i}+y_i-1}\right] \\ \frac{1}{\sum_{\boldsymbol{\theta}}\left[\prod _{i=1}^{k}\theta_{i}^{\alpha_{i}+y_i-1}\right]}&=\frac{\Gamma \left(\sum _{i=1}^{k}(\alpha_{i}+y_i)\right)}{\prod _{i=1}^{k}\Gamma (\alpha_{i}+y_i)} \\ \end{aligned}
将此结论代入P(D∣θ)P(D|\boldsymbol{\theta})可得
P(θ∣D)=∏i=1kθiαi+yi−1∑θ[∏i=1kθiαi+yi−1]=Γ(∑i=1k(αi+yi))∏i=1kΓ(αi+yi)∏i=1kθiαi+yi−1=P(θ;α+y)\begin{aligned} P(\boldsymbol{\theta}|D)&=\frac{\prod_{i=1}^{k}\theta_i^{\alpha_{i}+y_i-1}}{\sum_{\boldsymbol{\theta}}\left[\prod_{i=1}^{k}\theta_i^{\alpha_{i}+y_i-1}\right]}\\ &=\frac{\Gamma \left(\sum _{i=1}^{k}(\alpha_{i}+y_i)\right)}{\prod _{i=1}^{k}\Gamma (\alpha_{i}+y_i)}\prod _{i=1}^{k}\theta_{i}^{\alpha _{i}+y_i-1} \\ &=P(\boldsymbol{\theta};\boldsymbol{\alpha}+\boldsymbol{y}) \end{aligned}
综上可知,对于服从Categorical分布的θ\boldsymbol{\theta}来说,假设其先验概率P(θ)P(\boldsymbol{\theta})是参数为α\boldsymbol{\alpha}的Dirichlet分布时,得到的后验概率P(θ∣D)P(\boldsymbol{\theta}|D)是参数为α+y\boldsymbol{\alpha}+\boldsymbol{y}的Dirichlet分布,通常我们称这种先验概率分布和后验概率分布形式相同的这对分布为共轭分布。在推得后验概率P(θ∣D)P(\boldsymbol{\theta}|D)的具体形式以后,根据后验期望值估计可得θi\theta_i的估计值为
θi=EP(θ∣D)[θi]=EP(θ;α+y)[θi]=αi+yi∑j=1k(αj+yj)=αi+yi∑j=1kαj+∑j=1kyj=αi+yi∑j=1kαj+m\begin{aligned} \theta_i&=\mathbb E_{P(\boldsymbol{\theta}|D)}[\theta_i]\\ &=\mathbb E_{P(\boldsymbol{\theta};\boldsymbol{\alpha}+\boldsymbol{y})}[\theta_i]\\ &=\frac{\alpha_i+y_i}{\sum_{j=1}^k(\alpha_j+y_j)}\\ &=\frac{\alpha_i+y_i}{\sum_{j=1}^k\alpha_j+\sum_{j=1}^ky_j}\\ &=\frac{\alpha_i+y_i}{\sum_{j=1}^k\alpha_j+m}\\ \end{aligned}
显然,式(7.9)是当α=(1,1,...,1)\boldsymbol{\alpha}=(1,1,...,1)时推得的具体结果,此时等价于我们主观预设的先验概率P(θ)P(\boldsymbol{\theta})服从均匀分布,此即拉普拉斯修正。同理,当我们调整α\boldsymbol{\alpha}的取值后,即可推得其他数据平滑的公式。
7.4 半朴素贝叶斯分类器
7.4.1 式(7.21)的解释
在朴素贝叶斯中求解P(xi∣c)P(x_i|c)时,先挑出类别为cc的样本,若是离散属性则按大数定律估计P(xi∣c)P(x_i|c),若是连续属性则求这些样本的均值和方差,接着按正态分布估计P(xi∣c)P(x_i|c)。现在估计P(xi∣c,pai)P(x_i|c,pa_i),则是先挑出类别为cc且属性xix_i所依赖的属性为paipa_i的样本,剩下步骤与估计P(xi∣c)P(x_i|c)时相同。
7.4.2 式(7.22)的解释
该式写为如下形式可能更容易理解:
I(xi,xj∣y)=∑n=1NP(xi,xj∣cn)logP(xi,xj∣cn)P(xi∣cn)P(xj∣cn)I(x_i,x_j|y)=\sum_{n=1}^{N}P(x_i,x_j|c_n)\log\frac{P(x_i,x_j|c_n)}{P(x_i|c_n)P(x_j|c_n)}
其中i,j=1,2,...,di,j=1,2,...,d且i≠ji\neq j,NN为类别个数。该式共可得到d(d−1)2\frac{d(d-1)}{2}个I(xi,xj∣y)I(x_i,x_j|y),即每对(xi,xj)(x_i,x_j)均有一个条件互信息I(xi,xj∣y)I(x_i,x_j|y)。
7.4.3 式(7.23)的推导
基于贝叶斯定理,式(7.8)将联合概率P(x,c)P(\boldsymbol{x},c)写为等价形式P(x∣c)P(c)P(\boldsymbol{x}|c)P(c),实际上,也可将向量x\boldsymbol{x}拆开,把P(x,c)P(\boldsymbol{x},c)写为P(x1,x2,...,xd,c)P(x_1,x_2,...,x_d,c)形式,然后利用概率公式P(A,B)=P(A∣B)P(B)P(A,B)=P(A|B)P(B)对其恒等变形
P(x,c)=P(x1,x2,…,xd,c)=P(x1,x2,…,xd∣c)P(c)=P(x1,…,xi−1,xi+1,…,xd∣c,xi)P(c,xi)\begin{aligned} P(\boldsymbol{x}, c) & =P\left(x_1, x_2, \ldots, x_d, c\right) \\ & =P\left(x_1, x_2, \ldots, x_d \mid c\right) P(c) \\ & =P\left(x_1, \ldots, x_{i-1}, x_{i+1}, \ldots, x_d \mid c, x_i\right) P\left(c, x_i\right) \end{aligned}
类似式(7.14)采用属性条件独立性假设,则
P(x1,...,xi−1,xi+1,...,xd∣c,xi)=∏j=1j≠idP(xj∣c,xi)P(x_1,...,x_{i-1},x_{i+1},...,x_d|c,x_i)=\prod_{j=1\\j\neq i}^{d}P(x_j|c,x_i)
根据式(7.25)可知,当j=ij=i时,∣Dc,xi∣=∣Dc,xi,xj∣|D_{c,x_i}|=|D_{c,x_i,x_j}|,若不考虑平滑项,则此时P(xj∣c,xi)=1P(x_j|c,x_i)=1,因此在上式的连乘项中可放开j≠ij\neq i的约束,即
P(x1,...,xi−1,xi+1,...,xd∣c,xi)=∏j=1dP(xj∣c,xi)P(x_1,...,x_{i-1},x_{i+1},...,x_d|c,x_i)=\prod_{j=1}^{d}P(x_j|c,x_i)
综上可得:
P(c∣x)=P(x,c)P(x)=P(c,xi)P(x1,…,xi−1,xi+1,…,xd∣c,xi)P(x)∝P(c,xi)P(x1,…,xi−1,xi+1,…,xd∣c,xi)=P(c,xi)∏j=1dP(xj∣c,xi)\begin{aligned} P(c|\boldsymbol{x})&=\frac{P(\boldsymbol{x},c)}{P(\boldsymbol{x})}\\ &=\frac{P\left(c, x_i\right)P\left(x_1, \ldots, x_{i-1}, x_{i+1}, \ldots, x_d \mid c, x_i\right)}{P(\boldsymbol{x})}\\ &\propto P\left(c, x_i\right)P\left(x_1, \ldots, x_{i-1}, x_{i+1}, \ldots, x_d \mid c, x_i\right) \\ &=P\left(c, x_i\right)\prod_{j=1}^{d}P(x_j|c,x_i) \end{aligned}
上式是将属性xix_i作为超父属性的,AODE尝试将每个属性作为超父来构建SPODE,然后将那些具有足够训练数据支撑的SPODE集成起来作为最终结果。具体来说,对于总共dd个属性来说,共有dd个不同的上式,集成直接求和即可,因为对于不同的类别标记cc均有dd个不同的上式,至于如何满足"足够训练数据支撑的SPODE"这个条件,注意式(7.24)和式(7.25)均使用到了∣Dc,xi∣|D_{c,x_i}|和∣Dc,xi,xj∣|D_{c,x_i,x_j}|,若集合DxiD_{x_i}中样本数量过少,则∣Dc,xi∣|D_{c,x_i}|和∣Dc,xi,xj∣|D_{c,x_i,x_j}|将会更小,因此在式(7.23)中要求集合DxiD_{x_i}中样本数量不少于m′m^{\prime}。
7.4.4 式(7.24)和式(7.25)的推导
类比式(7.19)和式(7.20)的推导。
7.5 贝叶斯网
7.5.1 式(7.27)的解释
在这里补充一下同父结构和顺序结构的推导。同父结构:在给定父节点x1x_1的条件下x3,x4x_3,x_4独立
P(x3,x4∣x1)=P(x1,x3,x4)P(x1)=P(x1)P(x3∣x1)P(x4∣x1)P(x1)=P(x3∣x1)P(x4∣x1)\begin{aligned} P(x_3,x_4|x_1)&=\frac{P(x_1,x_3,x_4)}{P(x_1)} \\ &=\frac{P(x_1)P(x_3|x_1)P(x_4|x_1)}{P(x_1)} \\ &=P(x_3|x_1)P(x_4|x_1) \\ \end{aligned}
顺序结构:在给定节点xx的条件下y,zy,z独立
P(y,z∣x)=P(x,y,z)P(x)=P(z)P(x∣z)P(y∣x)P(x)=P(z,x)P(y∣x)P(x)=P(z∣x)P(y∣x)\begin{aligned} P(y,z|x)&=\frac{P(x,y,z)}{P(x)} \\ &=\frac{P(z)P(x|z)P(y|x)}{P(x)} \\ &=\frac{P(z,x)P(y|x)}{P(x)} \\ &=P(z|x)P(y|x) \\ \end{aligned}
7.6 EM算法
"西瓜书"中仅给出了EM算法的运算步骤,其原理并未展开讲解,下面补充EM算法的推导原理,以及所用到的相关数学知识。
7.6.1 Jensen不等式
若ff是凸函数,则下式恒成立
f(tx1+(1−t)x2)⩽tf(x1)+(1−t)f(x2)f\left(t x_1 + (1-t)x_2\right)\leqslant tf(x_1)+(1-t)f(x_2)
其中t∈[0,1]t\in [0,1],若将xx推广到nn个时同样成立,即
f(t1x1+t2x2+...+tnxn)⩽t1f(x1)+t2f(x2)+...+tnf(tn)f(t_1 x_1 + t_2x_2+...+t_nx_n)\leqslant t_1f(x_1)+t_2f(x_2)+...+t_nf(t_n)
其中t1,t2,...,tn∈[0,1],∑i=1nti=1t_1,t_2,...,t_n\in[0,1],\sum_{i=1}^{n}t_i=1。此不等式在概率论中通常以如下形式出现
φ(E[X])⩽E[φ(X)]\varphi(\mathbb{E}[X])\leqslant \mathbb{E}[\varphi(X)]
其中XX是随机变量,φ\varphi为凸函数,E[X]\mathbb{E}[X]为随机变量XX的期望。显然,若ff和φ\varphi是凹函数,则上述不等式中的⩽\leqslant换成⩾\geqslant也恒成立。
7.6.2 EM算法的推导
假设现有一批独立同分布的样本{x1,x2,...,xm}\{x_1,x_2,...,x_m\},它们是由某个含有隐变量的概率分布p(x,z;θ)p(x,z;\theta)生成,现尝试用极大似然估计法估计此概率分布的参数。为了便于讨论,此处假设zz为离散型随机变量,则对数似然函数为
LL(θ)=∑i=1mlnp(xi;θ)=∑i=1mln∑zip(xi,zi;θ)\begin{aligned} LL(\theta) &=\sum_{i=1}^{m} \ln p(x_i; \theta) \\ &=\sum_{i=1}^{m} \ln \sum_{z_i} p(x_i, z_i; \theta) \end{aligned}
显然,此时LL(θ)LL(\theta)里含有未知的隐变量zz以及求和项的对数,相比于不含隐变量的对数似然函数,显然该似然函数的极大值点较难求解,而EM算法则给出了一种迭代的方法来完成对LL(θ)LL(\theta)的极大化。
下面给出两种推导方法,一个是出自李航老师的《统计学习方法》[2],一个是出自吴恩达老师的CS229,两种推导方式虽然形式上有差异,但最终的QQ函数相等,接下来先讲述两种推导方法,最后会给出QQ函数是相等的证明。
首先给出《统计学习方法》中的推导方法,设X={x1,x2,...,xm},Z={z1,z2,...,zm}X=\{x_1,x_2,...,x_m\},Z=\{z_1,z_2,...,z_m\},则对数似然函数可以改写为
LL(θ)=lnP(X∣θ)=ln∑ZP(X,Z∣θ)=ln(∑ZP(X∣Z,θ)P(Z∣θ))\begin{aligned} LL(\theta)&=\ln P(X\vert \theta)\\ &=\ln \sum_Z P(X,Z\vert\theta)\\ &=\ln \left(\sum_Z P(X\vert Z,\theta)P(Z\vert \theta)\right) \end{aligned}
EM算法采用的是通过迭代逐步近似极大化L(θ)L(\theta):假设第tt次迭代时θ\theta的估计值是θ(t)\theta^{(t)},我们希望第t+1t+1次迭代时的θ\theta能使LL(θ)LL(\theta)增大,即LL(θ)>LL(θ(t))LL(\theta)>LL(\theta^{(t)})。为此,考虑两者的差
LL(θ)−LL(θ(t))=ln(∑ZP(X∣Z,θ)P(Z∣θ))−lnP(X∣θ(t))=ln(∑ZP(Z∣X,θ(t))P(X∣Z,θ)P(Z∣θ)P(Z∣X,θ(t)))−lnP(X∣θ(t))\begin{aligned} LL(\theta)-LL(\theta^{(t)})&=\ln \left(\sum_Z P(X\vert Z,\theta)P(Z\vert \theta)\right)-\ln P(X\vert\theta^{(t)}) \\ &=\ln \left(\sum_Z P(Z\vert X,\theta^{(t)}) \cfrac{P(X\vert Z,\theta)P(Z\vert \theta)}{P(Z\vert X,\theta^{(t)})}\right)-\ln P(X\vert\theta^{(t)}) \end{aligned}
由上述Jensen不等式可得
LL(θ)−LL(θ(t))⩾∑ZP(Z∣X,θ(t))lnP(X∣Z,θ)P(Z∣θ)P(Z∣X,θ(t))−lnP(X∣θ(t))=∑ZP(Z∣X,θ(t))lnP(X∣Z,θ)P(Z∣θ)P(Z∣X,θ(t))−1⋅lnP(X∣θ(t))=∑ZP(Z∣X,θ(t))lnP(X∣Z,θ)P(Z∣θ)P(Z∣X,θ(t))−∑ZP(Z∣X,θ(t))⋅lnP(X∣θ(t))=∑ZP(Z∣X,θ(t))(lnP(X∣Z,θ)P(Z∣θ)P(Z∣X,θ(t))−lnP(X∣θ(t)))=∑ZP(Z∣X,θ(t))lnP(X∣Z,θ)P(Z∣θ)P(Z∣X,θ(t))P(X∣θ(t))\begin{aligned} LL(\theta)-LL(\theta^{(t)}) &\geqslant \sum_Z P(Z\vert X,\theta^{(t)})\ln \cfrac{P(X\vert Z,\theta)P(Z\vert \theta)}{P(Z\vert X,\theta^{(t)})}-\ln P(X\vert\theta^{(t)}) \\ &= \sum_Z P(Z\vert X,\theta^{(t)})\ln \cfrac{P(X\vert Z,\theta)P(Z\vert \theta)}{P(Z\vert X,\theta^{(t)})}-1\cdot \ln P(X\vert\theta^{(t)}) \\ &= \sum_Z P(Z\vert X,\theta^{(t)})\ln \cfrac{P(X\vert Z,\theta)P(Z\vert \theta)}{P(Z\vert X,\theta^{(t)})}-\sum_Z P(Z\vert X,\theta^{(t)})\cdot \ln P(X\vert\theta^{(t)}) \\ &=\sum_Z P(Z\vert X,\theta^{(t)}) \left( \ln \cfrac{P(X\vert Z,\theta)P(Z\vert \theta)}{P(Z\vert X,\theta^{(t)})} - \ln P(X\vert\theta^{(t)}) \right)\\ &= \sum_Z P(Z\vert X,\theta^{(t)})\ln \cfrac{P(X\vert Z,\theta)P(Z\vert \theta)}{P(Z\vert X,\theta^{(t)})P(X\vert\theta^{(t)})} \end{aligned}
令
B(θ,θ(t))=LL(θ(t))+∑ZP(Z∣X,θ(t))lnP(X∣Z,θ)P(Z∣θ)P(Z∣X,θ(t))P(X∣θ(t))B(\theta,\theta^{(t)})=LL(\theta^{(t)})+\sum_Z P(Z\vert X,\theta^{(t)})\ln \cfrac{P(X\vert Z,\theta)P(Z\vert \theta)}{P(Z\vert X,\theta^{(t)})P(X\vert\theta^{(t)})}
则
LL(θ)⩾B(θ,θ(t))LL(\theta)\geqslant B(\theta,\theta^{(t)})
即B(θ,θ(t))B(\theta,\theta^{(t)})是LL(θ)LL(\theta)的下界,此时若设θ(t+1)\theta^{(t+1)}能使得B(θ,θ(t))B(\theta,\theta^{(t)})达到极大,即
B(θ(t+1),θ(t))⩾B(θ,θ(t))B(\theta^{(t+1)},\theta^{(t)}) \geqslant B(\theta,\theta^{(t)})
由于LL(θ(t))=B(θ(t),θ(t))LL(\theta^{(t)})=B(\theta^{(t)},\theta^{(t)}),那么可以进一步推得
LL(θ(t+1))⩾B(θ(t+1),θ(t))⩾B(θ(t),θ(t))=LL(θ(t))LL(\theta^{(t+1)})\geqslant B(\theta^{(t+1)},\theta^{(t)})\geqslant B(\theta^{(t)},\theta^{(t)})=LL(\theta^{(t)})
LL(θ(t+1))⩾LL(θ(t))LL(\theta^{(t+1)})\geqslant LL(\theta^{(t)})
因此,任何能使得B(θ,θ(t))B(\theta,\theta^{(t)})增大的θ\theta,也可以使得LL(θ)LL(\theta)增大,于是问题就转化为了求解能使得B(θ,θ(t))B(\theta,\theta^{(t)})达到极大的θ(t+1)\theta^{(t+1)},即
θ(t+1)=argmaxθB(θ,θ(t))=argmaxθ(LL(θ(t))+∑ZP(Z∣X,θ(t))lnP(X∣Z,θ)P(Z∣θ)P(Z∣X,θ(t))P(X∣θ(t)))\begin{aligned} \theta^{(t+1)}&=\mathop{\arg\max}_{\theta}B(\theta,\theta^{(t)}) \\ &=\mathop{\arg\max}_{\theta}\left( LL(\theta^{(t)})+\sum_Z P(Z\vert X,\theta^{(t)})\ln \cfrac{P(X\vert Z,\theta)P(Z\vert \theta)}{P(Z\vert X,\theta^{(t)})P(X\vert\theta^{(t)})}\right) \end{aligned}
略去对θ\theta极大化而言是常数的项
θ(t+1)=argmaxθ(∑ZP(Z∣X,θ(t))ln(P(X∣Z,θ)P(Z∣θ)))=argmaxθ(∑ZP(Z∣X,θ(t))lnP(X,Z∣θ))=argmaxθQ(θ,θ(t))\begin{aligned} \theta^{(t+1)}&=\mathop{\arg\max}_{\theta}\left(\sum_Z P(Z\vert X,\theta^{(t)})\ln\left( P(X\vert Z,\theta)P(Z\vert \theta)\right)\right) \\ &=\mathop{\arg\max}_{\theta}\left(\sum_Z P(Z\vert X,\theta^{(t)})\ln P(X,Z\vert \theta)\right) \\ &=\mathop{\arg\max}_{\theta}Q(\theta,\theta^{(t)}) \end{aligned}
到此即完成了EM算法的一次迭代,求出的θ(t+1)\theta^{(t+1)}作为下一次迭代的初始θ(t)\theta^{(t)}。综上,EM算法的"E步"和"M步"可总结为以下两步。
E步:计算完全数据的对数似然函数lnP(X,Z∣θ)\ln P(X,Z\vert \theta)关于在给定观测数据XX和当前参数θ(t)\theta^{(t)}下对未观测数据ZZ的条件概率分布P(Z∣X,θ(t))P(Z\vert X,\theta^{(t)})的期望Q(θ,θ(t))Q(\theta,\theta^{(t)}):
Q(θ,θ(t))=EZ[lnP(X,Z∣θ)∣X,θ(t)]=∑ZP(Z∣X,θ(t))lnP(X,Z∣θ)Q(\theta,\theta^{(t)})=\mathbb{E}_Z[\ln P(X,Z\vert \theta)\vert X,\theta^{(t)}]=\sum_Z P(Z\vert X,\theta^{(t)})\ln P(X,Z\vert \theta)
M步:求使得Q(θ,θ(t))Q(\theta,\theta^{(t)})达到极大的θ(t+1)\theta^{(t+1)}。
接下来给出CS229中的推导方法,设ziz_i的概率质量函数为Qi(zi)Q_i(z_i),则LL(θ)LL(\theta)可以作如下恒等变形
LL(θ)=∑i=1mlnp(xi;θ)=∑i=1mln∑zip(xi,zi;θ)=∑i=1mln∑ziQi(zi)p(xi,zi;θ)Qi(zi)\begin{aligned} LL(\theta) &=\sum_{i=1}^{m} \ln p(x_i; \theta) \\ &=\sum_{i=1}^{m} \ln \sum_{z_i} p(x_i, z_i; \theta) \\ &=\sum_{i=1}^{m} \ln \sum_{z_i} Q_i(z_i)\cfrac{p(x_i, z_i; \theta)}{Q_i(z_i)} \\ \end{aligned}
其中∑ziQi(zi)p(xi,zi;θ)Qi(zi)\sum_{z_i} Q_i(z_i)\cfrac{p(x_i, z_i; \theta)}{Q_i(z_i)}可以看做是对p(xi,zi;θ)Qi(zi)\cfrac{p(x_i, z_i; \theta)}{Q_i(z_i)}关于ziz_i求期望,即
∑ziQi(zi)p(xi,zi;θ)Qi(zi)=Ezi[p(xi,zi;θ)Qi(zi)]\sum_{z_i} Q_i(z_i)\cfrac{p(x_i, z_i; \theta)}{Q_i(z_i)}=\mathbb{E}_{z_i}\left[\cfrac{p(x_i, z_i; \theta)}{Q_i(z_i)}\right]
由Jensen不等式可得
ln(Ezi[p(xi,zi;θ)Qi(zi)])⩾Ezi[ln(p(xi,zi;θ)Qi(zi))]\ln\left(\mathbb{E}_{z_i}\left[\cfrac{p(x_i, z_i; \theta)}{Q_i(z_i)}\right]\right)\geqslant \mathbb{E}_{z_i}\left[\ln\left(\cfrac{p(x_i, z_i; \theta)}{Q_i(z_i)}\right)\right]
ln∑ziQi(zi)p(xi,zi;θ)Qi(zi)⩾∑ziQi(zi)lnp(xi,zi;θ)Qi(zi)\ln\sum_{z_i} Q_i(z_i)\cfrac{p(x_i, z_i; \theta)}{Q_i(z_i)}\geqslant \sum_{z_i} Q_i(z_i)\ln\cfrac{p(x_i, z_i; \theta)}{Q_i(z_i)}
将此式代入LL(θ)LL(\theta)可得
\begin{aligned} LL(\theta) &=\sum_{i=1}^{m} \ln \sum_{z_i} Q_i(z_i)\cfrac{p(x_i, z_i; \theta)}{Q_i(z_i)}\geqslant \sum_{i=1}^{m}\sum_{z_i} Q_i(z_i)\ln\cfrac{p(x_i, z_i; \theta)}{Q_i(z_i)} \quad \textcircled{1} \end{aligned}
若令B(θ)=∑i=1m∑ziQi(zi)lnp(xi,zi;θ)Qi(zi)B(\theta)=\sum\limits_{i=1}^{m}\sum\limits_{z_i} Q_i(z_i)\ln\cfrac{p(x_i, z_i; \theta)}{Q_i(z_i)},则此时B(θ)B(\theta)为LL(θ)LL(\theta)的下界函数,那么这个下界函数所能构成的最优下界是多少?即B(θ)B(\theta)的最大值是多少?显然,B(θ)B(\theta)是LL(θ)LL(\theta)的下界函数,反过来LL(θ)LL(\theta)是其上界函数,所以如果能使得B(θ)=LL(θ)B(\theta)=LL(\theta),则此时的B(θ)B(\theta)就取到了最大值。根据Jensen不等式的性质可知,如果能使得p(xi,zi;θ)Qi(zi)\cfrac{p(x_i, z_i; \theta)}{Q_i(z_i)}恒等于某个常量cc,大于等于号便可以取到等号。因此,只需任意选取满足p(xi,zi;θ)Qi(zi)=c\cfrac{p(x_i, z_i; \theta)}{Q_i(z_i)}=c的Qi(zi)Q_i(z_i)就能使得B(θ)B(\theta)达到最大值。由于Qi(zi)Q_i(z_i)是ziz_i的概率质量函数,所以Qi(zi)Q_i(z_i)同时也满足约束0⩽Qi(zi)⩽1,∑ziQi(zi)=10\leqslant Q_i(z_i)\leqslant 1,\sum_{z_i} Q_i(z_i)=1,结合Qi(zi)Q_i(z_i)的所有约束可以推得
p(xi,zi;θ)Qi(zi)=c\cfrac{p(x_i, z_i; \theta)}{Q_i(z_i)}=c
p(xi,zi;θ)=c⋅Qi(zi)p(x_i, z_i; \theta)=c\cdot Q_i(z_i)
∑zip(xi,zi;θ)=c⋅∑ziQi(zi)\sum_{z_i}p(x_i, z_i; \theta)=c\cdot \sum_{z_i}Q_i(z_i)
∑zip(xi,zi;θ)=c\sum_{z_i}p(x_i, z_i; \theta)=c
p(xi,zi;θ)Qi(zi)=∑zip(xi,zi;θ)\cfrac{p(x_i, z_i; \theta)}{Q_i(z_i)}=\sum_{z_i}p(x_i, z_i; \theta)
Qi(zi)=p(xi,zi;θ)∑zip(xi,zi;θ)=p(xi,zi;θ)p(xi;θ)=p(zi∣xi;θ)Q_i(z_i)=\cfrac{p(x_i, z_i; \theta)}{\sum\limits_{z_i}p(x_i, z_i; \theta)}=\cfrac{p(x_i, z_i; \theta)}{p(x_i; \theta)}=p(z_i|x_i; \theta)
所以,当且仅当Qi(zi)=p(zi∣xi;θ)Q_i(z_i)=p(z_i|x_i; \theta)时B(θ)B(\theta)取到最大值,将Qi(zi)=p(zi∣xi;θ)Q_i(z_i)=p(z_i|x_i; \theta)代回LL(θ)LL(\theta)和B(θ)B(\theta)可以推得
\begin{aligned} LL(\theta) &=\sum_{i=1}^{m} \ln \sum_{z_i} Q_i(z_i)\cfrac{p(x_i, z_i; \theta)}{Q_i(z_i)} & \quad \textcircled{2}\\ &=\sum_{i=1}^{m} \ln \sum_{z_i}p(z_i|x_i; \theta)\cfrac{p(x_i, z_i; \theta)}{p(z_i|x_i; \theta)} & \quad \textcircled{3}\\ &=\sum_{i=1}^{m}\sum_{z_i} p(z_i|x_i; \theta)\ln\cfrac{p(x_i, z_i; \theta)}{p(z_i|x_i; \theta)} & \quad \textcircled{4}\\ &=\max\{B(\theta)\} & \quad \textcircled{5} \\ \end{aligned}
其中式\textcircled{4}是式\textcircled{1}中不等式取等号时的情形。由以上推导可知,此时对数似然函数LL(θ)LL(\theta)等价于其下界函数的最大值max{B(θ)}\max\{B(\theta)\},所以要想极大化LL(θ)LL(\theta)可以通过极大化max{B(θ)}\max\{B(\theta)\}来间接极大化LL(θ)LL(\theta),因此,下面考虑如何极大化max{B(θ)}\max\{B(\theta)\}。假设已知第tt次迭代的参数为θ(t)\theta^{(t)},而第t+1t+1次迭代的参数θ(t+1)\theta^{(t+1)}可通过如下方式求得
\begin{aligned} \theta^{(t+1)}&=\arg\max_{\theta}\max\{B(\theta)\} & \quad \textcircled{6}\\ &=\arg\max_{\theta}\sum_{i=1}^{m}\sum_{z_i} p(z_i|x_i;\theta^{(t)})\ln\cfrac{p(x_i, z_i; \theta)}{p(z_i|x_i; \theta^{(t)})} & \quad \textcircled{7}\\ &=\arg\max_{\theta}\sum_{i=1}^{m}\sum_{z_i} p(z_i|x_i;\theta^{(t)})\ln p(x_i, z_i; \theta) & \quad \textcircled{8} \end{aligned}
此时将θ(t+1)\theta^{(t+1)}代入LL(θ)LL(\theta)可推得
\begin{aligned} LL(\theta^{(t+1)}) &=\max\{B(\theta^{(t+1)})\} &\quad\textcircled{9} \\ &=\sum_{i=1}^{m}\sum_{z_i} p(z_i|x_i; \theta^{(t+1)})\ln\cfrac{p(x_i, z_i; \theta^{(t+1)})}{p(z_i|x_i; \theta^{(t+1)})} &\quad\textcircled{10}\\ &\geqslant \sum_{i=1}^{m}\sum_{z_i} p(z_i|x_i; \theta^{(t)})\ln\cfrac{p(x_i, z_i; \theta^{(t+1)})}{p(z_i|x_i; \theta^{(t)})} &\quad\textcircled{11}\\ &\geqslant \sum_{i=1}^{m}\sum_{z_i} p(z_i|x_i; \theta^{(t)})\ln\cfrac{p(x_i, z_i; \theta^{(t)})}{p(z_i|x_i; \theta^{(t)})} &\quad\textcircled{12}\\ &=\max\{B(\theta^{(t)})\} &\quad\textcircled{13} \\ &=LL(\theta^{(t)})&\quad\textcircled{14} \end{aligned}
其中,式\textcircled{9}和式\textcircled{10}分别由式\textcircled{5}和式\textcircled{4}推得,式\textcircled{11}由式\textcircled{1}推得,式\textcircled{12}由式\textcircled{7}推得,式\textcircled{13}和式\textcircled{14}由式\textcircled{2}至式\textcircled{5}推得。此时若令
Q(θ,θ(t))=∑i=1m∑zip(zi∣xi;θ(t))lnp(xi,zi;θ)Q(\theta,\theta^{(t)})=\sum_{i=1}^{m}\sum_{z_i} p(z_i|x_i; \theta^{(t)})\ln p(x_i, z_i; \theta)
由式\textcircled{9}至式\textcircled{14}可知,凡是能使得Q(θ,θ(t))Q(\theta,\theta^{(t)})达到极大的θ(t+1)\theta^{(t+1)}一定能使得LL(θ(t+1))⩾LL(θ(t))LL(\theta^{(t+1)})\geqslant LL(\theta^{(t)})。综上,EM算法的"E步"和"M步"可总结为以下两步。
E步:令Qi(zi)=p(zi∣xi;θ)Q_i(z_i)=p(z_i|x_i; \theta)并写出Q(θ,θ(t))Q(\theta,\theta^{(t)});
M步:求使得Q(θ,θ(t))Q(\theta,\theta^{(t)})到达极大的θ(t+1)\theta^{(t+1)}。
以上便是EM算法的两种推导方法,下面证明两种推导方法中的QQ函数相等。
Q(θ∣θ(t))=∑ZP(Z∣X,θ(t))lnP(X,Z∣θ)=∑z1,z2,...,zm{∏i=1mP(zi∣xi,θ(t))ln[∏i=1mP(xi,zi∣θ)]}=∑z1,z2,...,zm{∏i=1mP(zi∣xi,θ(t))[∑i=1mlnP(xi,zi∣θ)]}=∑z1,z2,...,zm{∏i=1mP(zi∣xi,θ(t))[lnP(x1,z1∣θ)+lnP(x2,z2∣θ)+...+lnP(xm,zm∣θ)]}=∑z1,z2,...,zm[∏i=1mP(zi∣xi,θ(t))⋅lnP(x1,z1∣θ)]+...+∑z1,z2,...,zm[∏i=1mP(zi∣xi,θ(t))⋅lnP(xm,zm∣θ)]\begin{aligned} Q(\theta|\theta^{(t)})&=\sum_Z P(Z|X,\theta^{(t)})\ln P(X,Z|\theta) \\ &=\sum_{z_1,z_2,...,z_m}\left\{\prod_{i=1}^mP(z_i|x_i,\theta^{(t)})\ln\left[ \prod_{i=1}^m P(x_i,z_i|\theta) \right] \right\} \\ &=\sum_{z_1,z_2,...,z_m}\left\{\prod_{i=1}^mP(z_i|x_i,\theta^{(t)})\left[ \sum_{i=1}^m\ln P(x_i,z_i|\theta) \right] \right\} \\ &=\sum_{z_1,z_2,...,z_m}\left\{\prod_{i=1}^mP(z_i|x_i,\theta^{(t)})\left[\ln P(x_1,z_1|\theta) + \ln P(x_2,z_2|\theta) +...+ \ln P(x_m,z_m|\theta)\right] \right\} \\ &=\sum_{z_1,z_2,...,z_m}\left[\prod_{i=1}^mP(z_i|x_i,\theta^{(t)})\cdot\ln P(x_1,z_1|\theta) \right]+...+\sum_{z_1,z_2,...,z_m}\left[\prod_{i=1}^mP(z_i|x_i,\theta^{(t)})\cdot\ln P(x_m,z_m|\theta) \right] \\ \end{aligned}
其中∑z1,z2,...,zm[∏i=1mP(zi∣xi,θ(t))⋅lnP(x1,z1∣θ)]\sum\limits_{z_1,z_2,...,z_m}\left[\prod\limits_{i=1}^mP(z_i|x_i,\theta^{(t)})\cdot\ln P(x_1,z_1|\theta) \right]可作如下恒等变形:
∑z1,z2,...,zm[∏i=1mP(zi∣xi,θ(t))⋅lnP(x1,z1∣θ)]=∑z1,z2,...,zm[∏i=2mP(zi∣xi,θ(t))⋅P(z1∣x1,θ(t))⋅lnP(x1,z1∣θ)]=∑z1∑z2,...,zm[∏i=2mP(zi∣xi,θ(t))⋅P(z1∣x1,θ(t))⋅lnP(x1,z1∣θ)]=∑z1P(z1∣x1,θ(t))lnP(x1,z1∣θ)∑z2,...,zm[∏i=2mP(zi∣xi,θ(t))]=∑z1P(z1∣x1,θ(t))lnP(x1,z1∣θ)∑z2,...,zm[∏i=3mP(zi∣xi,θ(t))⋅P(z2∣x2,θ(t))]=∑z1P(z1∣x1,θ(t))lnP(x1,z1∣θ){∑z2∑z3,...,zm[∏i=3mP(zi∣xi,θ(t))⋅P(z2∣x2,θ(t))]}=∑z1P(z1∣x1,θ(t))lnP(x1,z1∣θ){∑z2P(z2∣x2,θ(t))∑z3,...,zm[∏i=3mP(zi∣xi,θ(t))]}=∑z1P(z1∣x1,θ(t))lnP(x1,z1∣θ){∑z2P(z2∣x2,θ(t))×∑z3P(z3∣x3,θ(t))×...×∑zmP(zm∣xm,θ(t))}=∑z1P(z1∣x1,θ(t))lnP(x1,z1∣θ)×{1×1×...×1}=∑z1P(z1∣x1,θ(t))lnP(x1,z1∣θ)\begin{aligned} &\sum\limits_{z_1,z_2,...,z_m}\left[\prod\limits_{i=1}^mP(z_i|x_i,\theta^{(t)})\cdot\ln P(x_1,z_1|\theta) \right] \\ =&\sum\limits_{z_1,z_2,...,z_m}\left[\prod_{i=2}^mP(z_i|x_i,\theta^{(t)})\cdot P(z_1|x_1,\theta^{(t)})\cdot\ln P(x_1,z_1|\theta) \right] \\ =&\sum_{z_1}\sum\limits_{z_2,...,z_m}\left[\prod_{i=2}^mP(z_i|x_i,\theta^{(t)})\cdot P(z_1|x_1,\theta^{(t)})\cdot\ln P(x_1,z_1|\theta) \right] \\ =&\sum_{z_1}P(z_1|x_1,\theta^{(t)})\ln P(x_1,z_1|\theta) \sum\limits_{z_2,...,z_m}\left[\prod_{i=2}^mP(z_i|x_i,\theta^{(t)}) \right] \\ =&\sum_{z_1}P(z_1|x_1,\theta^{(t)})\ln P(x_1,z_1|\theta)\sum\limits_{z_2,...,z_m}\left[\prod_{i=3}^mP(z_i|x_i,\theta^{(t)})\cdot P(z_2|x_2,\theta^{(t)}) \right] \\ =&\sum_{z_1}P(z_1|x_1,\theta^{(t)})\ln P(x_1,z_1|\theta) \left\{\sum\limits_{z_2}\sum\limits_{z_3,...,z_m}\left[\prod_{i=3}^mP(z_i|x_i,\theta^{(t)})\cdot P(z_2|x_2,\theta^{(t)}) \right]\right\} \\ =&\sum_{z_1}P(z_1|x_1,\theta^{(t)})\ln P(x_1,z_1|\theta) \left\{\sum_{z_2}P(z_2|x_2,\theta^{(t)}) \sum\limits_{z_3,...,z_m}\left[\prod_{i=3}^mP(z_i|x_i,\theta^{(t)})\right]\right\} \\ =&\sum_{z_1}P(z_1|x_1,\theta^{(t)})\ln P(x_1,z_1|\theta) \left\{\sum_{z_2}P(z_2|x_2,\theta^{(t)})\times\sum_{z_3}P(z_3|x_3,\theta^{(t)})\times...\times\sum_{z_m}P(z_m|x_m,\theta^{(t)})\right\} \\ =&\sum_{z_1}P(z_1|x_1,\theta^{(t)})\ln P(x_1,z_1|\theta)\times \left\{1\times1\times...\times1\right\} \\ =&\sum_{z_1}P(z_1|x_1,\theta^{(t)})\ln P(x_1,z_1|\theta) \\ \end{aligned}
所以
∑z1,z2,...,zm[∏i=1mP(zi∣xi,θ(t))⋅lnP(x1,z1∣θ)]=∑z1P(z1∣x1,θ(t))lnP(x1,z1∣θ)\sum\limits_{z_1,z_2,...,z_m}\left[\prod_{i=1}^mP(z_i|x_i,\theta^{(t)})\cdot\ln P(x_1,z_1|\theta) \right]=\sum_{z_1}P(z_1|x_1,\theta^{(t)})\ln P(x_1,z_1|\theta)
同理可得
∑z1,z2,...,zm[∏i=1mP(zi∣xi,θ(t))⋅lnP(x2,z2∣θ)]=∑z2P(z2∣x2,θ(t))lnP(x2,z2∣θ)⋮∑z1,z2,...,zm[∏i=1mP(zi∣xi,θ(t))⋅lnP(xm,zm∣θ)]=∑zmP(zm∣xm,θ(t))lnP(xm,zm∣θ)\begin{aligned} \sum\limits_{z_1,z_2,...,z_m}\left[\prod_{i=1}^mP(z_i|x_i,\theta^{(t)})\cdot\ln P(x_2,z_2|\theta) \right] &=\sum_{z_2}P(z_2|x_2,\theta^{(t)})\ln P(x_2,z_2|\theta) \\ &\vdots\\ \sum\limits_{z_1,z_2,...,z_m}\left[\prod_{i=1}^mP(z_i|x_i,\theta^{(t)})\cdot\ln P(x_m,z_m|\theta) \right] &=\sum_{z_m}P(z_m|x_m,\theta^{(t)})\ln P(x_m,z_m|\theta) \end{aligned}
将上式代入Q(θ∣θ(t))Q(\theta|\theta^{(t)})可得
Q(θ∣θ(t))=∑z1,z2,...,zm[∏i=1mP(zi∣xi,θ(t))⋅lnP(x1,z1∣θ)]+...+∑z1,z2,...,zm[∏i=1mP(zi∣xi,θ(t))⋅lnP(xm,zm∣θ)]=∑z1P(z1∣x1,θ(t))lnP(x1,z1∣θ)+...+∑zmP(zm∣xm,θ(t))lnP(xm,zm∣θ)=∑i=1m∑ziP(zi∣xi,θ(t))lnP(xi,zi∣θ)\begin{aligned} Q(\theta|\theta^{(t)})&=\sum_{z_1,z_2,...,z_m}\left[\prod_{i=1}^mP(z_i|x_i,\theta^{(t)})\cdot\ln P(x_1,z_1|\theta) \right]+...+\sum_{z_1,z_2,...,z_m}\left[\prod_{i=1}^mP(z_i|x_i,\theta^{(t)})\cdot\ln P(x_m,z_m|\theta) \right] \\ &=\sum_{z_1}P(z_1|x_1,\theta^{(t)})\ln P(x_1,z_1|\theta) +...+\sum_{z_m}P(z_m|x_m,\theta^{(t)})\ln P(x_m,z_m|\theta) \\ &=\sum_{i=1}^m\sum_{z_i}P(z_i|x_i,\theta^{(t)})\ln P(x_i,z_i|\theta)\\ \end{aligned}
参考文献
[1] 陈希孺. 概率论与数理统计. 中国科学技术大学出版社, 2009.
[2] 李航. 统计学习方法. 清华大学出版社, 2012.