一、定义及原理

knn算法是通过计算两个个体之间的距离及相似性来进行分类,几乎适合于任何数据集,同时其计算量会很大;

从训练集中找到和新数据距离最近的k条记录,然后根据这k条记录的分类来决定新数据的类别,因此,使用knn的关键是训练集与测试集的选取、距离或相似性度量指标的确定、k的大小及分类决策规则的确定;

优点:没有统计学及机器学习中的估计参数、无需进行训练(确保数据集的准确性),适合对稀有事件进行分类和多分类,效果好于svm.

缺点:对测试样本分类时的计算量大,内存开销大

 

二、实例应用

1、用knn进行汽车目的地预测


竞赛背景

美国东北大学一个科研团队的一项研究表明,人类93%的行为都是可以预测的,正因为如此,交通才得以合理规划,城市才得以有序发展。在实际应用中,其实并没有那么容易预测,数据的缺乏是一个重要的原因,那么在有限的数据下,我们能够在多大程度上预测出人们的行为呢?

预测思路:

1.把开始时间,如20:20:34,按小时划分为0-23,把日期,如2018-09-09,按照工作日、休息日、小长假依次划分为0、1、2;

2.把训练集的结束位置的经纬度用geohash(python的一个工具包)转化为字符串作为标签

3.把按照以上处理好的数据带入knn

数据格式:

训练数据
r_key,out_id,start_time,end_time,start_lat,start_lon,end_lat,end_lon
 SDK-XJ_609994b4d50a8a07a64d41d1f70bbb05,2016061820000b,2018-01-20 10:13:43,2018-01-20 10:19:04,33.783415000000005,111.60366,33.779810999999995,111.60588500000001
 SDK-XJ_4c2f29d94c9478623711756e4ae34cc5,2016061820000b,2018-02-12 17:40:51,2018-02-12 17:58:13,34.810763,115.549264,34.814875,115.549374
 SDK-XJ_3570183177536a575b9da67a86efcd62,2016061820000b,2018-02-13 14:52:24,2018-02-13 15:24:33,34.640284,115.539024,34.813136,115.559243测试数据:
 r_key,out_id,start_time,start_lat,start_lon
 f6fa6b2a1fa250b3_SDK-XJ_eed80f24f496fc9a59f49e031edfe9b8,358962079107966,2018-09-01 15:54:12,43.943356,125.37771799999999
 a584728d1eb0fb5b_SDK-XJ_d60de6f0b8121b07383e80c0b176d0fa,358962079111695,2018-09-01 13:16:11,43.886501,125.272971
 7308d46abc5ec4d0_SDK-XJ_6dd3f0f118e9813c51ed224ed09444c2,358962079120563,2018-09-01 18:08:36,43.867917,125.30785300000001

代码测试:

# -*- coding: utf-8 -*-
import datetime
import os
import time
from collections import Counter

import geohash
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.utils import shuffle

"""
汽车目的地智能预测大赛_knn
"""


def datetime_to_period(date_str):
    """
    描述:把时间分为24段
    返回:0到23
    """
    time_part = date_str.split(" ")[1]  # 获取时间部分
    hour_part = int(time_part.split(":")[0])  # 获取小时
    return hour_part


def date_to_period(date_str):
    """
    描述:把日期转化为对应的工作日或者节假日
    返回:0:工作日 1:节假日 2:小长假
    """
    holiday_list = ['2018-01-01', '2018-02-15', '2018-02-16', '2018-02-17', '2018-02-18', '2018-02-19',
                    '2018-02-20', '2018-02-21', '2018-04-05', '2018-04-06', '2018-04-07', '2018-04-29',
                    '2018-04-30', '2018-05-01', '2018-06-16', '2018-06-17', '2018-06-18']  # 小长假
    switch_workday_list = ['2018-02-11', '2018-02-24', '2018-04-08', '2018-04-28']  # 小长假补班
    workday_list = ['1', '2', '3', '4', '5']  # 周一到周五
    weekday_list = ['0', '6']  # 周六、周日,其中0表示周日
    date = date_str.split(" ")[0]  # 获取日期部分
    whatday = datetime.datetime.strptime(date_str, '%Y-%m-%d %H:%M:%S').strftime("%w")  # 把日期转化为星期
    if date in holiday_list:
        return 2
    elif date in switch_workday_list:
        return 0
    elif whatday in workday_list:
        return 0
    elif whatday in weekday_list:
        return 1


time_start = time.asctime(time.localtime(time.time()))  # 程序开始时间

# ---删除相关文件---
path = "C:\\Users\\yang\\Desktop\\"  # 文件路径
lst = ['predict', 'predict_result', 'view', 'score', 'train']  # 文件名列表
for file_name in lst:
    file_path = path + file_name + '.csv'
    if os.path.exists(file_path):
        os.remove(file_path)

train_data_path = "C:\\Users\\yang\\Desktop\\train_new.csv"
train_data = pd.read_csv(train_data_path, low_memory=False)
test_data_path = "C:\\Users\\yang\\Desktop\\test_new.csv"
test_data = pd.read_csv(test_data_path, low_memory=False)

n = 0
test_out_id = Counter(test_data['out_id'])
for out_id in test_out_id.keys():
    # ----train_new数据补充字段 begin----
    train = train_data[train_data['out_id'] == out_id]  # 选择出同一个out_id的数据
    train = shuffle(train)  # 打乱顺序
    train['start_code'] = None  # 开始位置的geohash编码
    train['end_code'] = None  # 结束位置的geohash编码
    train['period'] = None  # 时间段编码(0-23)
    train['week_code'] = None  # 工作日和休息日编码
    for i in range(len(train)):
        train.iloc[i, 8] = geohash.encode(train.iloc[i, 4], train.iloc[i, 5], 8)  # 开始geohash编码
        train.iloc[i, 9] = geohash.encode(train.iloc[i, 6], train.iloc[i, 7], 8)  # 结束geohash编码
        train.iloc[i, 10] = datetime_to_period(train.iloc[i, 2])  # 添加时间段
        train.iloc[i, 11] = date_to_period(train.iloc[i, 2])  # 添加工作日、休息日编码
    # ----train_new数据补充字段 end----

    # ----test_new数据补充字段 begin----
    test = test_data[test_data['out_id'] == out_id]
    test = shuffle(test)  # 打乱顺序
    test['period'] = None
    test['week_code'] = None
    test['start_code'] = None
    test['predict'] = None
    for i in range(len(test)):
        test.iloc[i, 5] = datetime_to_period(test.iloc[i, 2])  # 添加时间段
        test.iloc[i, 6] = date_to_period(test.iloc[i, 2])  # 添加工作日、休息日编码
        test.iloc[i, 7] = geohash.encode(test.iloc[i, 3], test.iloc[i, 4], 8)  # 开始geohash编码
    # ----test_new数据补充字段 end----

    # ---knn begin---
    knn = KNeighborsClassifier(n_neighbors=10, weights='distance', algorithm='auto', p=2)
    knn.fit(train[['start_lat', 'start_lon', 'period', 'week_code']], train['end_code'])
    predict = knn.predict(test[['start_lat', 'start_lon', 'period', 'week_code']])
    test['predict'] = predict
    # ---knn end---
    if n == 0:
        test.to_csv("C:\\Users\\yang\\Desktop\\predict.csv", mode='a', encoding='utf-8', index=False,
                    header=True)
    else:
        test.to_csv("C:\\Users\\yang\\Desktop\\predict.csv", mode='a', encoding='utf-8', index=False,
                    header=False)
    if n % 500 == 0:
        print("已运行:" + str(n) + " " + time.asctime(time.localtime(time.time())))
    n = n + 1
print("输出结果:\n")
df = pd.read_csv("C:\\Users\\yang\\Desktop\\predict.csv")  # 预测结果文件
df['end_lat'] = None
df['end_lon'] = None
for i in range(len(df)):
    site = geohash.decode(df.iloc[i, 8])
    df.iloc[i, 9] = site[0]  # 预测横坐标
    df.iloc[i, 10] = site[1]  # 预测纵坐标
    if i % 5000 == 0:
        print("已运行" + str(i))
df = df[['r_key', 'end_lat', 'end_lon']]
df.to_csv("C:\\Users\\yang\\Desktop\\predict_result.csv", encoding='utf-8', index=False)

print('\r程序运行开始时间:', time_start)
print('\r程序运行结束时间:', time.asctime(time.localtime(time.time())))

2.用knn对电离数据进行分类;

主要目的是使用knn进行数据分类,该数据集每行有35个值,前34个为天线采集的数据,目标数据为最后一个值g或b(表示好与坏

 

部分数据视图:

1,0,0.47337,0.19527,0.06213,-0.18343,0.62316,0.01006,0.45562,-0.04438,0.56509,0.01775,0.44675,0.27515,0.71598,-0.03846,0.55621,0.12426,0.41420,0.11538,0.52767,0.02842,0.51183,-0.10651,0.47929,-0.02367,0.46514,0.03259,0.53550,0.25148,0.31953,-0.14497,0.34615,-0.00296,g
1,0,0.59887,0.14689,0.69868,-0.13936,0.85122,-0.13936,0.80979,0.02448,0.50471,0.02825,0.67420,-0.04520,0.80791,-0.13748,0.51412,-0.24482,0.81544,-0.14313,0.70245,-0.00377,0.33333,0.06215,0.56121,-0.33145,0.61444,-0.16837,0.52731,-0.02072,0.53861,-0.31262,0.67420,-0.22034,g
1,0,0.84713,-0.03397,0.86412,-0.08493,0.81953,0,0.73673,-0.07643,0.71975,-0.13588,0.74947,-0.11677,0.77495,-0.18684,0.78132,-0.21231,0.61996,-0.10191,0.79193,-0.15711,0.89384,-0.03397,0.84926,-0.26115,0.74115,-0.23312,0.66242,-0.22293,0.72611,-0.37792,0.65817,-0.24841,g
1,0,0.87772,-0.08152,0.83424,0.07337,0.84783,0.04076,0.77174,-0.02174,0.77174,-0.05707,0.82337,-0.10598,0.67935,-0.00543,0.88043,-0.20924,0.83424,0.03261,0.86413,-0.05978,0.97283,-0.27989,0.85054,-0.18750,0.83705,-0.10211,0.85870,-0.03261,0.78533,-0.10870,0.79076,-0.00543,g
1,0,0.74704,-0.13241,0.53755,0.16996,0.72727,0.09486,0.69565,-0.11067,0.66798,-0.23518,0.87945,-0.19170,0.73715,0.04150,0.63043,-0.00395,0.63636,-0.11858,0.79249,-0.25296,0.66403,-0.28656,0.67194,-0.10474,0.61847,-0.12041,0.60079,-0.20949,0.37549,0.06917,0.61067,-0.01383,g
1,0,0.46785,0.11308,0.58980,0.00665,0.55432,0.06874,0.47894,-0.13969,0.52993,0.01330,0.63858,-0.16186,0.67849,-0.03326,0.54545,-0.13525,0.52993,-0.04656,0.47894,-0.19512,0.50776,-0.13525,0.41463,-0.20177,0.53930,-0.11455,0.59867,-0.02882,0.53659,-0.11752,0.56319,-0.04435,g
1,0,0.88116,0.27475,0.72125,0.42881,0.61559,0.63662,0.38825,0.90502,0.09831,0.96128,-0.20097,0.89200,-0.35737,0.77500,-0.65114,0.62210,-0.78768,0.45535,-0.81856,0.19095,-0.83943,-0.08079,-0.78334,-0.26356,-0.67557,-0.45511,-0.54732,-0.60858,-0.30512,-0.66700,-0.19312,-0.75597,g
1,0,0.93147,0.29282,0.79917,0.55756,0.59952,0.71596,0.26203,0.92651,0.04636,0.96748,-0.23237,0.95130,-0.55926,0.81018,-0.73329,0.62385,-0.90995,0.36200,-0.92254,0.06040,-0.93618,-0.19838,-0.83192,-0.46906,-0.65165,-0.69556,-0.41223,-0.85725,-0.13590,-0.93953,0.10007,-0.94823,g
1,0,0.88241,0.30634,0.73232,0.57816,0.34109,0.58527,0.05717,1,-0.09238,0.92118,-0.62403,0.71996,-0.69767,0.32558,-0.81422,0.41195,-1,-0.00775,-0.78973,-0.41085,-0.76901,-0.45478,-0.57242,-0.67605,-0.31610,-0.81876,-0.02979,-0.86841,0.25392,-0.82127,0.00194,-0.81686,g
1,0,0.83479,0.28993,0.69256,0.47702,0.49234,0.68381,0.21991,0.86761,-0.08096,0.85011,-0.35558,0.77681,-0.52735,0.58425,-0.70350,0.31291,-0.75821,0.03939,-0.71225,-0.15317,-0.58315,-0.39168,-0.37199,-0.52954,-0.16950,-0.60863,0.08425,-0.61488,0.25164,-0.48468,0.40591,-0.35339,g
1,0,0.92870,0.33164,0.76168,0.62349,0.49305,0.84266,0.21592,0.95193,-0.13956,0.96167,-0.47202,0.83590,-0.70747,0.65490,-0.87474,0.36750,-0.91814,0.05595,-0.89824,-0.26173,-0.73969,-0.54069,-0.50757,-0.74735,-0.22323,-0.86122,0.07810,-0.87159,0.36021,-0.78057,0.59407,-0.60270,g

 

思路:首先,直接将数据划分为训练集与标签集进行模型的预测,其次,加入交叉验证进行knn模型做一个对比,(取每次交叉验证结果的平均值作为最后的输出,也可以选择最优的结果值作为预测)交叉验证部分使用sklearn库中提供的函数cross_val_score()

参数设置:在机器学习或深度学习中,都涉及到相关参数或超参数的设置,提高模型的泛化能力,参数的选取绝大数情况下是依据训练结果选取,k的选取同样没有一定的规则,可以设置为一个序列比如1~20,进行验证,最后选取一个较优的结果。

代码测试:

模型一:

import numpy as np
import csv
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# 新建特征数据,默认为float类型
X = np.zeros((351, 34), dtype="float")
# 新建类别数据
y = np.zeros((351,), dtype="bool")

# 读取数据集并且赋值
with open("ionosphere.data", "r") as input_file:
    reader = csv.reader(input_file)
    for i, row in enumerate(reader):
        data = [float(datum) for datum in row[:-1]]
        X[i] = data
        y[i] = row[-1] == 'g'


# 对数据进行分割,创建训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=14)

# 建立KNN模型
estimator = KNeighborsClassifier()
# 使用训练数据进行训练
estimator.fit(X_train, y_train)
# 用测试集测试算法,评估它在测试集上的表现
y_predicted = estimator.predict(X_test)
# 查看正确率
accuracy = np.mean(y_test == y_predicted) * 100
print("{:.1f}".format(accuracy))

运行结果如下:

86.4%

模型二:加入交叉验证

from sklearn.model_selection import cross_val_score
import numpy as np
import csv
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
# 保存不同参数下的平均分
avg_socre = []
# 参数从1-20
para = list(range(1, 21))
# 对每个参数进行交叉检验,将结果保存到列表
for o in para:
    estimator = KNeighborsClassifier(n_neighbors=o)
    score = cross_val_score(estimator, X, y, scoring="accuracy")
    avg_socre.append(np.mean(score))
from matplotlib import pyplot as plt
# 可视化处理
plt.plot(para, avg_socre, "-o")
plt.show()

注:此段代码接上段代码,运行结果如下:82.3%

加入交叉验证(求每次运算结果的均值作为输出)后结果反而降低了,同时,可以使用交叉验证得出的最后结果作为输出,机器学习与数据有很大的关系。

最后:

knn作为一种优秀的分类回归算法,在应用中有不同的应用,本文仅仅选取和参考了两个实例,从实例背景上有很大差距,但是在算法的使用中,殊途同归,这与场景的分析,数据的处理和格式,思路有莫大的关系,因此,在面对不同领域,不同背景下的问题及应用,如何选择适当的数据处理方式,算法模型具有很大的灵活性,需要谨慎考虑,才能获得更优的结果。