0%

Naive Bayes

生成模型

对于给定数据集,首先基于特征条件独立假设学习输入输出的联合概率分布; 然后基于此模型, 对给定的输入 $x$, 利用贝叶斯定理求出后验概率最大的输出 $y$。 朴素贝叶斯法实现简单, 学习与预测的效率都很高, 是一种常用的方法。

1. 学习与分类

  • 首先学习先验概率分布以及条件概率分布。

$$
P(Y=c_k),\quad k=1,2,…k
$$

$$
P(X=x \mid Y=c_{k})=P(X^{(1)}=x^{(1)}, \cdots, X^{(n)}=x^{(n)} \mid Y=c_{k}), \quad k=1,2, \cdots, K
$$

于是学到联合概率分布 $P(X,Y)$

由于朴素贝叶斯假设条件独立,于是条件概率为
$$
\begin{aligned}
P(X=x \mid Y=c_{k}) &=P(X^{(1)}=x^{(1)}, \cdots, X^{(n)}=x^{(n)} \mid Y=c_{k}) \\
&=\prod_{j=1}^{n} P(X^{(j)}=x^{(j)} \mid Y=c_{k})
\end{aligned}
$$

  • 参数学习(极大似然)
    $$
    P\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}{N}, \quad k=1,2, \cdots, K
    $$

    $$
    \begin{array}{l}
    P\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)} \\
    j=1,2, \cdots, n ; \quad l=1,2, \cdots, S_{j} ; \quad k=1,2, \cdots, K
    \end{array}
    $$

  • 分类器

$$
y=f(x)=\arg \max_{c_{k}} \frac{P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right)}{\sum_{k} P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right)}
$$

2. 原理(后验概率最大化 == 期望风险最小化)

假设选择0-1损失函数
$$
L(Y,f(X)) = \begin{cases}
1, & Y \neq f(X) \\
0, & Y = f(X)
\end{cases}
$$
期望风险函数为
$$
R_{\exp }(f)=E[L(Y, f(X))]
$$
期望是对联合分布 $P(X,Y)$取的,由此条件期望
$$
R_{\exp }(f)=E_{X} \sum_{k=1}^{K}\left[L\left(c_{k}, f(X)\right)\right] P\left(c_{k} \mid X\right)
$$
为了使期望风险最小化,只需对 $X=x$ 逐个极小化,
$$
\begin{aligned}
f(x) &=\arg \min_{y \in \mathcal{Y}} \sum_{k=1}^{K} L\left(c_{k}, y\right) P\left(c_{k} \mid X=x\right) \\
&=\arg \min_{y \in \mathcal{Y}} \sum_{k=1}^{K} P\left(y \neq c_{k} \mid X=x\right) \\
&=\arg \min_{y \in \mathcal{Y}}\left(1-P\left(y=c_{k} \mid X=x\right)\right) \\
&=\arg \max_{y \in \mathcal{Y}} P\left(y=c_{k} \mid X=x\right)
\end{aligned}
$$

3. 贝叶斯估计(拉普拉斯平滑)

用极大似然估计可能会出现所要估计的概率值为0的情况。 这时会影响到后验概率的计算结果, 使分类产生偏差。 解决这一问题的方法是采用贝叶斯估计。具体的条件概率的贝叶斯估计是
$$
P_{\lambda}\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)+\lambda}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)+S_{j} \lambda}
$$
$l=1,2,\cdots,S_j,k=1,2,…,K$ ,$\lambda \geq 0$,等价于在随机变量各个取值上赋予一个正数。$\lambda = 0$ 是极大似然估计。常取 $\lambda = 1$,称为拉普拉斯平滑。检验可知,概率和为1,并且每个概率大于0

先验概率的贝叶斯估计是
$$
P_{\lambda}\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)+\lambda}{N+K \lambda}
$$

4. 代码实现

1
2
3
4
5
6
7
8
9
10
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

from collections import Counter
import math
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# import data
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['labels'] = iris.target
df.columns = [
'sepal length', 'sepal width', 'petal length', 'petal width', 'label'
]
data = np.array(df)
X, Y = data[:, :-1], data[:, -1]

def _shuffle(X, Y):
randomize = np.arange(len(X))
np.random.shuffle(randomize)
return X[randomize], Y[randomize]

X, Y = _shuffle(X,Y)

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
class NaiveBayes:
def __init__(self):
self.model = None
self.Y_mean = None


# 计算均值
def mean(self, X):
return sum(X) / float(len(X))

# 计算标准差
def std(self, X):
avg = self.mean(X)
return math.sqrt(sum([pow(x - avg, 2) for x in X]) / float(len(X)))

# 概率密度函数
def gaussian_probability(self, x, mean, std):
exp = math.exp(-(math.pow(x - mean, 2) / (2 * math.pow(std, 2))))
return (1 / (math.sqrt(2 * math.pi)) * std) * exp

# 计算X_train的mean 和std
def summarize(self, X_train): # *X_train 星号作用是解包然后逐个传入
summaries = [(self.mean(i), self.std(i)) for i in zip(*X_train)] # 所以zip(*X_train) == zip([1,2,3],[2,3,4],...)
return summaries

def fit(self, X, Y):
labels = list(set(Y))
data = {label: [] for label in labels} # 初始化
self.Y_mean = np.zeros(len(labels))
for x, label in zip(X, Y):
data[label].append(x)
self.Y_mean[int(label)] += 1.
self.Y_mean /= len(Y)
# 计算P(x_i|y_k) for i in range...
self.model = {
label:self.summarize(value)
for label, value in data.items()
}
return 'gaussianNB train done'

def calculate_probabilities(self, input_data):
# summaries: {0: [(mean1,std1),(mean2,std2),(mean3,std3),(mean4,std4)], 1:...}
probabilities = {}
for label, value in self.model.items(): # value --> summaries
probabilities[label] = self.Y_mean[int(label)] # P(C_i)
for i in range(len(value)):
mean, std = value[i]
probabilities[label] *= self.gaussian_probability(input_data[i], mean, std)
# 计算P(x|c_i) * P(C_i)的概率
return probabilities

def predict(self, X_test):
# 这里X_test为单个实例
result = []
for i in range(len(X_test)):
label = sorted(self.calculate_probabilities(X_test[i]).items(), key=lambda x:x[-1])[-1][0] # sorted 默认从小到大 返回一个list
# 如[(1, 75), (0, 85), (2, 95)]
result.append(label)

return result

def score(self, X_test, Y_test):
right = 0
predictions = self.predict(X_test)
for i in range(len(Y_test)):
if predictions[i] == Y_test[i]:
right += 1
return right / float(len(Y_test))
1
model = NaiveBayes()
1
2
model.fit(X_train, Y_train)
# 'gaussianNB train done'
1
result1 = model.predict(X_test)
1
model.score(X_test, Y_test)

scikit-learn 实例

1
2
3
4
5
6
7
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
#X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
#Y = np.array([1, 1, 1, 2, 2, 2])
clf.fit(X_train, Y_train)
result2 = clf.predict(X_test)
clf.score(X_test, Y_test)

Reference:

[1]. https://github.com/fengdu78/lihang-code

[2]. 《统计学习方法》 – 李航