K-近邻(KNN)

KNN算法是一种非常简单的分类算法。KNN属于懒惰学习(lazy learning)，即没有显式的学习过程，也就是说没有训练阶段，数据集事先已有了分类和特征值，待收到新样本后直接进行处理。

简单来说，KNN算法就是找到离自己最近的$K$个邻居，看看邻居之中的多数是哪一类的，那么自己就是他们中的一员了。一般选择$K$为基数，而且尽量不大于$\sqrt n$。

两点之间的距离一般使用欧氏距离：

$\text{dist}_i = \Vert x^{(i)} - x \Vert = \sqrt {\sum_{k=1}^n(x^{(i)}_k - x_k)^2}$

归一化

在计算不同尺度的数据时，记得对数据进行归一化(normalization)，下面介绍三种标准化的方法：

0-1 标准化

$x_{normalization} = \frac {x-x_{min}} {x_{max} - x_{min}}$

Z-score标准化

$x_{normalization} = \frac {x-\mu} {\sigma}$

Sigmoid标准化

$x_{normalization} = \frac {1} {1+e^{-x}}$

代码实现

下面是用python代码实现：

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
# 导入库

dataset = pd.read_table('datingTestSet.txt', header = None)
# 导入数据，header = None 表示原数据没有表头

Colors = []
for i in range(dataset.shape[0]):
    if dataset.iloc[i, -1] == 'didntLike':
        Colors.append('black')
    if dataset.iloc[i, -1] == 'smallDoses':
        Colors.append('orange')
    if dataset.iloc[i, -1] == 'largeDoses':
        Colors.append('red')
# 首先画图，对不同类别的数据用不同颜色进行区分

plt.rcParams['font.sans-serif'] = ['Simhei']
pl = plt.figure(figsize = (12, 20))

fig1 = pl.add_subplot(311)
plt.scatter(dataset.iloc[:, 0], dataset.iloc[:, 1], marker = '.', c = Colors)
plt.xlabel('飞机里程数')
plt.ylabel('玩游戏占比')
fig2 = pl.add_subplot(312)
plt.scatter(dataset.iloc[:, 0], dataset.iloc[:, 2], marker = '.', c = Colors)
plt.xlabel('飞机里程数')
plt.ylabel('吃的冰激凌')
fig3 = pl.add_subplot(313)
plt.scatter(dataset.iloc[:, 1], dataset.iloc[:, 2], marker = '.', c = Colors)
plt.xlabel('玩游戏占比')
plt.ylabel('吃的冰激凌')
# 数据可视化

def normal(dataset):
    maxdf = dataset.iloc[:, :-1].max()
    mindf = dataset.iloc[:, :-1].min()
    normaldata = (dataset.iloc[:, :-1] - mindf) / (maxdf - mindf)
    return normaldata
# 定义数据归一化函数，这里采用0-1标准化

nor_data = pd.concat([normal(dataset), dataset.iloc[:, -1]], axis = 1)
# 归一化数据

def randSplit(dataset, rate = 0.9):
    dataset = dataset.sample(frac = 1).reset_index(drop = True)
    m = dataset.shape[0]
    n = int(m * rate)
    train = dataset.iloc[:n, :]
    test = dataset.iloc[n:, :]
    test.index = range(test.shape[0])
    return train, test
# 定义测试集、训练集划分函数

def my_knn(train, test, k):
    m = train.shape[1] - 1
    n = test.shape[0]
    result = []
    for i in range(n):
        dist = ((train.iloc[:, :m] - test.iloc[i, :m]) ** 2).sum(axis = 1) ** 0.5
        # 计算欧氏距离
        dist_with_lable = pd.DataFrame({'distance': dist, 'class': train.iloc[:, -1]})
        # 将距离后面加上label
        re = dist_with_lable.sort_values(by = 'distance')[:k].iloc[:, -1].value_counts().index[0]
        # 得到结果
        result.append(re)
    test['protect'] = pd.Series(result)
    acc = (test.iloc[:, -1] == test.iloc[:, -2]).mean()
    return acc
# 定义的KNN函数