上了斯坦福Andrew NG 课,把所有的练习用matlab 做完一遍之后感觉意犹未尽,因此决定用pyton 将课内算法逐一实现一遍,以加深理解,同时也避免自己成为调包侠,哈哈,话不多说,进入正题。
Kmeans 是一个经典的无监督聚类算法,算法内容比较容易理解。有兴趣的同学可以百度相关论文研读其内容,这里不再赘述。
Kmeans 算法流程如下:
Input:
-K (聚类数目,即所需分类的数目)
-Training set {X1,X2,X2,X3,...Xm} 总共 M 个训练样本,期中 Xi∈Rn(即 Xi是n维向量空间)
训练样本可以用一个 m*n 矩阵标示。行数目m代表着有m个训练样本,列数目n代表这每个训练样本有n个特征。
Algorithm
1. 初始化K 个聚类中心,采用随机初始化方法,即从训练样本m中随机选取K个样本作为初始化聚类中心。
Randomly initialize K cluster centroids, u1,u2,....uk.
2. Repeat
{
遍历所有样本点,根据其距聚类中心的距离(本算法采用欧式距离)将其归类到对应的分类项中,可以简洁的表述为两个循环
step1: Cluster assignment step (分配聚类中心)

For i = 1 to m: 
 { 
Ci=j that minimizes ||Xi−uj||2
Ci:index from 1 to k of cluster centroid cloest to Xi
Ci 可以理解为是训练样本Xi所属的聚类中心的index, 
 例如,X2 距离u3 最近,则C2 =3 
 } 
step2: Move centroid (移动聚类中心) 
 For k =1 to k: 
 { 
uK=1|Ck|∗∑i∈CkXi
 其中Ck是某一聚类中心k的训练样本个数。例如, 某两个样本X3,X5 所属的聚类中心均为2,即C3=2,C5=2, 那么u2=12∗(x3+x5)
 } 
 }


重复迭代以上step1 和step2数次后,即可得到结果。

多说一点:
1.由于采用了随机初始化聚类中心的方法,因为每次运行的结果不尽相同。为了优化结果,在初始化聚类中心时,我们可以通过多次选取聚类中心,以损失函数值最小的那一组作为初始化聚类中心。可以简洁表述

为: 
 For i = 1 to 100: 
 { 
 Randomly initialize K means. 
 Run K means once, get C1...Cm,u1...uk
 Compute cost function : 
J(C1...Cm,u1...uk), 选择J 做小的那一组值作为初始化聚类中心。 
 其中J=1m∗∑i=1m||Xi−uic||2
 }


2. 有时我们并不知道我们的Kmeans算法K值应该选取多少,最优的K又是多少呢? 这里我们可以采用elbow method. 绘出cost function J value随K变化的曲线,该曲线的拐点所对应的K值通常就是最优的分类书目。
源代码如下:这个版本的源代码并没有讲算法完全封装起来,便于调试。初始化构造函数部分根据不同测试用例需要调整。本代码测试用例是31省人均消费情况,数据下载:聚类文件夹下[(https://pan.baidu.com/s/1eR7doh8)]聚类文件夹下。

import numpy as np
import numpy.random as random
from numpy import linalg as LA
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import spline

class KMeans:
    def __init__(self,filePath,data=None,xRow = 0,xCol = 0):
        self.filePath = filePath
        fr = open(filePath,'r+')
        lines = fr.readlines()
        retData = []
        retCityName = []
        for line in lines:
            items = line.strip().split(",")
            retCityName.append(items[0])
            retData.append([float(items[i])  for i in range(1,len(items))])
        self.dataName = retCityName
        self.data = np.array(retData)
        self.xRow, self.xCol = [i for i in self.data.shape]

    def IntialCentroids(self,K):
        randomX = random.permutation(self.data)
        centroids = randomX[0:K, :]
        idx = self.findClosetCentroids(centroids)
        jSumMin = self.cosFunc(idx,centroids)
        #Random intial and choose the best begin Centroids with lowest cost function value J
        for i in range(100):
            randomX = random.permutation(self.data)
            tempCentroids = randomX[0:K, :]
            idx = self.findClosetCentroids(tempCentroids)
            newCentroids = self.computeCentroids(idx,K)
            jSum = self.cosFunc(idx,newCentroids)
            if jSum < jSumMin:
                centroids = newCentroids
                jSumMin = jSum
            else:
                continue
        return centroids

    def cosFunc(self,idx,centroids):
        jSum = 0
        for i in range(self.xRow):
            jSum = LA.norm(self.data[i,:] - centroids[int(idx[i]-1), :]) + jSum
        jSum = jSum/(self.xRow)
        return jSum

    def findClosetCentroids(self, centroids):
        kSize, kFeatures = [i for i in centroids.shape]
        idx = np.zeros((self.xRow, 1))
        for i in range(self.xRow):
            idx[i] = 1
            norm_min = LA.norm(self.data[i, :] - centroids[1, :])
            for j in range(kSize):
                if LA.norm(self.data[i, :] - centroids[j, :]) < norm_min:
                    norm_min = LA.norm(self.data[i, :] - centroids[j, :])
                    idx[i] = j
        return idx

    def computeCentroids(self,idx,K):
        centroids_matrix = np.zeros((K,self.xRow))
        centroids = np.zeros((K,self.xCol))
        CK = np.zeros((K,1))
        for i in range (self.xRow):
            centroids_matrix[int(idx[i]),i] = 1
            CK[int(idx[i])] = CK[int(idx[i])] + 1
        for i in range(K):
            centroids_Temp = np.dot(centroids_matrix,self.data)
            centroids[i,:] = centroids_Temp[i,:]/CK[i,:]
        return centroids

    def jCostPlot(self,kNum):
        x_data_plot = []
        y_data_plot = []
        for i in range(kNum):
            x_data_plot.append(i + 2)
            centroids = U.IntialCentroids(i + 2)
            for j in range(300):
                idx = U.findClosetCentroids(centroids)
                centroids = U.computeCentroids(idx, i + 2)
                jCos = U.cosFunc(idx, centroids)
            y_data_plot.append(jCos)
        x_data = np.array(x_data_plot)
        y_data = np.array(y_data_plot)
        xNew = np.linspace(x_data.min(), x_data.max(), 300)
        ySmooth = spline(x_data, y_data, xNew)
        plt.plot(xNew, ySmooth)
        plt.show()
        return None

if __name__=='__main__':
    U = KMeans('city.txt')
    centroids = U.IntialCentroids(4)
    for i in range(300):
        idx = U.findClosetCentroids(centroids)
        centroids = U.computeCentroids(idx, 4)
    labeled_City = [[], [], [], []]
    for i in range(len(U.dataName)):
        labeled_City[int(idx[i])].append([U.dataName[i]])
    for i in range(4):
        print(labeled_City[i])
    '''If want to see the cost function value change accompanied with K value, use the
        U.jCostPlot(K_Num), K_Num is the classification numbers , u can choose 10 to see what
        will plot out. According to eblow methond, u can find the best classification numbers.