上了斯坦福Andrew NG 课,把所有的练习用matlab 做完一遍之后感觉意犹未尽,因此决定用pyton 将课内算法逐一实现一遍,以加深理解,同时也避免自己成为调包侠,哈哈,话不多说,进入正题。
Kmeans 是一个经典的无监督聚类算法,算法内容比较容易理解。有兴趣的同学可以百度相关论文研读其内容,这里不再赘述。
Kmeans 算法流程如下:
Input:
-K (聚类数目,即所需分类的数目)
-Training set {X1,X2,X2,X3,...Xm} 总共 M 个训练样本,期中 Xi∈Rn(即 Xi是n维向量空间)
训练样本可以用一个 m*n 矩阵标示。行数目m代表着有m个训练样本,列数目n代表这每个训练样本有n个特征。
Algorithm
1. 初始化K 个聚类中心,采用随机初始化方法,即从训练样本m中随机选取K个样本作为初始化聚类中心。
Randomly initialize K cluster centroids, u1,u2,....uk.
2. Repeat
{
遍历所有样本点,根据其距聚类中心的距离(本算法采用欧式距离)将其归类到对应的分类项中,可以简洁的表述为两个循环
step1: Cluster assignment step (分配聚类中心)
For i = 1 to m:
{
Ci=j that minimizes ||Xi−uj||2
Ci:index from 1 to k of cluster centroid cloest to Xi
Ci 可以理解为是训练样本Xi所属的聚类中心的index,
例如,X2 距离u3 最近,则C2 =3
}
step2: Move centroid (移动聚类中心)
For k =1 to k:
{
uK=1|Ck|∗∑i∈CkXi
其中Ck是某一聚类中心k的训练样本个数。例如, 某两个样本X3,X5 所属的聚类中心均为2,即C3=2,C5=2, 那么u2=12∗(x3+x5)
}
}
重复迭代以上step1 和step2数次后,即可得到结果。
多说一点:
1.由于采用了随机初始化聚类中心的方法,因为每次运行的结果不尽相同。为了优化结果,在初始化聚类中心时,我们可以通过多次选取聚类中心,以损失函数值最小的那一组作为初始化聚类中心。可以简洁表述
为:
For i = 1 to 100:
{
Randomly initialize K means.
Run K means once, get C1...Cm,u1...uk
Compute cost function :
J(C1...Cm,u1...uk), 选择J 做小的那一组值作为初始化聚类中心。
其中J=1m∗∑i=1m||Xi−uic||2
}
2. 有时我们并不知道我们的Kmeans算法K值应该选取多少,最优的K又是多少呢? 这里我们可以采用elbow method. 绘出cost function J value随K变化的曲线,该曲线的拐点所对应的K值通常就是最优的分类书目。
源代码如下:这个版本的源代码并没有讲算法完全封装起来,便于调试。初始化构造函数部分根据不同测试用例需要调整。本代码测试用例是31省人均消费情况,数据下载:聚类文件夹下[(https://pan.baidu.com/s/1eR7doh8)]聚类文件夹下。
import numpy as np
import numpy.random as random
from numpy import linalg as LA
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import spline
class KMeans:
def __init__(self,filePath,data=None,xRow = 0,xCol = 0):
self.filePath = filePath
fr = open(filePath,'r+')
lines = fr.readlines()
retData = []
retCityName = []
for line in lines:
items = line.strip().split(",")
retCityName.append(items[0])
retData.append([float(items[i]) for i in range(1,len(items))])
self.dataName = retCityName
self.data = np.array(retData)
self.xRow, self.xCol = [i for i in self.data.shape]
def IntialCentroids(self,K):
randomX = random.permutation(self.data)
centroids = randomX[0:K, :]
idx = self.findClosetCentroids(centroids)
jSumMin = self.cosFunc(idx,centroids)
#Random intial and choose the best begin Centroids with lowest cost function value J
for i in range(100):
randomX = random.permutation(self.data)
tempCentroids = randomX[0:K, :]
idx = self.findClosetCentroids(tempCentroids)
newCentroids = self.computeCentroids(idx,K)
jSum = self.cosFunc(idx,newCentroids)
if jSum < jSumMin:
centroids = newCentroids
jSumMin = jSum
else:
continue
return centroids
def cosFunc(self,idx,centroids):
jSum = 0
for i in range(self.xRow):
jSum = LA.norm(self.data[i,:] - centroids[int(idx[i]-1), :]) + jSum
jSum = jSum/(self.xRow)
return jSum
def findClosetCentroids(self, centroids):
kSize, kFeatures = [i for i in centroids.shape]
idx = np.zeros((self.xRow, 1))
for i in range(self.xRow):
idx[i] = 1
norm_min = LA.norm(self.data[i, :] - centroids[1, :])
for j in range(kSize):
if LA.norm(self.data[i, :] - centroids[j, :]) < norm_min:
norm_min = LA.norm(self.data[i, :] - centroids[j, :])
idx[i] = j
return idx
def computeCentroids(self,idx,K):
centroids_matrix = np.zeros((K,self.xRow))
centroids = np.zeros((K,self.xCol))
CK = np.zeros((K,1))
for i in range (self.xRow):
centroids_matrix[int(idx[i]),i] = 1
CK[int(idx[i])] = CK[int(idx[i])] + 1
for i in range(K):
centroids_Temp = np.dot(centroids_matrix,self.data)
centroids[i,:] = centroids_Temp[i,:]/CK[i,:]
return centroids
def jCostPlot(self,kNum):
x_data_plot = []
y_data_plot = []
for i in range(kNum):
x_data_plot.append(i + 2)
centroids = U.IntialCentroids(i + 2)
for j in range(300):
idx = U.findClosetCentroids(centroids)
centroids = U.computeCentroids(idx, i + 2)
jCos = U.cosFunc(idx, centroids)
y_data_plot.append(jCos)
x_data = np.array(x_data_plot)
y_data = np.array(y_data_plot)
xNew = np.linspace(x_data.min(), x_data.max(), 300)
ySmooth = spline(x_data, y_data, xNew)
plt.plot(xNew, ySmooth)
plt.show()
return None
if __name__=='__main__':
U = KMeans('city.txt')
centroids = U.IntialCentroids(4)
for i in range(300):
idx = U.findClosetCentroids(centroids)
centroids = U.computeCentroids(idx, 4)
labeled_City = [[], [], [], []]
for i in range(len(U.dataName)):
labeled_City[int(idx[i])].append([U.dataName[i]])
for i in range(4):
print(labeled_City[i])
'''If want to see the cost function value change accompanied with K value, use the
U.jCostPlot(K_Num), K_Num is the classification numbers , u can choose 10 to see what
will plot out. According to eblow methond, u can find the best classification numbers.