1. 降维方法的种类
MLLIB中提供了两种密切相关的降维模型:主成分分析(PCA)和奇异值分解(SVD)
(1)主成分分析
主成分分析(PCA) 是一种对数据进行旋转变换的统计学方法,其本质是在线性空间中进行一个基变换,使得
变换后的数据投影在一组新的“坐标轴”上的方差最大化,随后,裁剪掉变换后方差很小的“坐标轴”,剩下的新“坐标
轴”即被称为 主成分(Principal Component) ,它们可以在一个较低维度的子空间中尽可能地表示原有数据的
性质主成分分析被广泛应用在各种统计学、机器学习问题中,是最常见的降维方法之一。PCA有许多具体的实现方法,
可以通过计算协方差矩阵,甚至是通过下文提到的SVD分解来进行PCA变换
(2)奇异值分解
SVD分解将一个m * n的矩阵X分解为3个主成分矩阵:
U:m * m 矩阵,称为左奇异矩阵
S:m * n 的对角矩阵,其中每个元素都是奇异值
V^T:n * n 矩阵,称为右奇异矩阵
表达式可以理解为:
X = U * S * V^T
所以,将U,S,V相乘就可以得到原始矩阵
可是这样,我们感觉到数据并没有进行降维,因为在一般实际应用中,我们只是取前K个奇异值,前K个奇异值
已经包含了大部分奇异能量(包含大部分特征),剩余的奇异值将被舍弃
矩阵取前K个奇异值,矩阵X重建为:
2.代码案例
在spark中,SVD和PCA实现的方法都在行矩阵(RowMatrix)和行索引矩阵(IndexedRowMatrix)中。
(1) 奇异值案例
本次采用的数据集是自己处理过的一个向量数组
[1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
[1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
[0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
[0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
[0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
[0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
[1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
[0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0]
[0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
[0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0]
[0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
// 将数据转为rdd转为行矩阵
val matrix = new RowMatrix(spark.sparkContext.parallelize(source, 2))
// 行矩阵调用computeSVD方法实现SVD分解,取出前3个奇异值,第二个参数是要不要输出U,默认为False
val SVD: SingularValueDecomposition[RowMatrix, Matrix] = matrix.computeSVD(3,true)
// 输出u,s,v进行观察
println(SVD.V)
println("--------------")
println(SVD.s)
println("--------------")
SVD.U.rows.foreach(println)
U:
-0.7886751345948209 2.8310687127941492E-14 7.019940184704865E-13
8.955633029205415E-17 -2.3635625206605712E-16 -0.3584926384897496
8.295165706028774E-17 -5.659843664451338E-16 -0.733586869304055
-8.958728223252302E-16 0.8660254037844384 -2.220446049250313E-16
3.0593819819143217E-16 -1.5298387709094965E-15 -5.112897073717542E-14
-0.21132486540521578 1.0902390101819037E-13 2.6206259384764508E-12
-7.967072787865741E-19 -5.732309946204197E-18 -7.238036108091668E-17
-0.5773502691896046 -8.060219158778636E-14 -1.918576408854733E-12
-1.3693123044784444E-18 -4.084351409232032E-18 -2.4597196556933114E-17
6.4372932348162975E-19 -5.033520172428588E-19 2.435490953940265E-17
-6.968703838822414E-18 -1.4791851542802145E-17 -4.8191124489711415E-17
-3.1295054986963717E-18 -3.982858105830577E-18 -7.383311186926517E-17
6.123915859750583E-18 1.2823386884847707E-17 4.927314329312069E-17
1.0818270557054564E-16 -3.646183221899531E-16 -0.17924631924487966
4.420104950445495E-18 1.1108127532201064E-16 -0.17924631924486986
-1.385491130079369E-18 9.635074221272796E-19 -4.443635227418211E-17
1.5547794223628764E-16 -4.556011523317249E-16 -0.3667934346520335
-7.785796156624745E-20 1.1114980987239887E-21 7.155390047092574E-18
6.149407332841813E-17 6.045652652062488E-17 -0.36679343465202136
1.1193143351271554E-18 1.5024165217837663E-18 -4.4423544623167647E-17
3.624344793460294E-18 3.2015649241686225E-18 7.246491382877062E-17
-4.188792726433983E-16 0.2886751345948131 3.497202527569243E-15
-2.03485454614922E-16 0.2886751345948123 -1.1435297153639112E-14
-3.537527903377281E-16 0.28867513459481325 7.382983113757291E-15
5.334603575640233E-18 1.819920907720661E-17 6.53765963388012E-17
-6.949557916364262E-19 -5.167529221926466E-18 -2.440236209941025E-17
-4.2655967110073276E-19 1.3448860745800597E-19 -1.0753058201174548E-17
1.8447384974463598E-18 -1.069172507550569E-18 6.114611272101371E-17
3.3676844633013193E-16 -1.5511317263641213E-15 -5.108368030740601E-14
-6.214568237441494E-18 -1.4700498749927554E-17 -6.747828598243039E-17
S:
[2.175327747161075,2.0,1.7320508075688772]
V:
[-0.6279630301995482,-2.6145752229922437E-14,-7.023941705797031E-13]
[9.090080155537957E-17,-3.004872871280051E-16,-0.310463732001837]
[-0.6279630301995482,-2.6145752229922437E-14,-7.023941705797031E-13]
[-0.45970084338099987,6.866729407306593E-14,1.918315526558137E-12]
[-5.744539480516396E-16,0.5773502691896258,4.134369775724663E-15]
[-6.043926468939928E-16,0.5773502691896257,1.890913308277171E-15]
[-5.053759270872523E-16,0.5773502691896253,-6.730369402342474E-15]
[6.640182408247018E-17,-2.5276391996225446E-16,-0.635304864700003]
[2.9545278653312463E-16,-1.5404852486368089E-15,-5.901250159517438E-14]
[1.0960628788362553E-16,-5.107927593884293E-16,-0.63530486470001]
[4.320104653891542E-17,-6.263748837202324E-17,-0.31046373200183136]
结论:
输出结果应证了刚才我们说的,U(左奇异矩阵)输出的是原始矩阵的列数 * K,S是设置取的三个奇异值,
V(右奇异矩阵)输出的是原始矩阵的行数 * K。
(2)主成分分析案例
// 依旧采用上面的数据集和行矩阵,保留三个主成分
val CPA: Matrix = matrix.computePrincipalComponents(3)
// 打印输出观察数据
println(CPA)
println("-----------")
// 可以通过矩阵乘法来完成对原矩阵的PCA变换
matrix.multiply(CPA).rows.foreach(println)
PCA:
-0.6601981677355941 0.23112169419651413 -1.4813704636313606E-16
0.0915385246398685 -0.38030884315531277 -0.5773502691896251
0.09153852463986839 -0.380308843155313 0.5773502691896263
0.45011783051752163 0.5937193456192285 5.872703754440493E-16
0.027003287938335827 -0.06422335350511683 -1.1827701304282196E-15
-0.17282560510216705 0.053444551301783064 -1.360853285667051E-15
0.0 0.0 0.0
-0.4873725626334274 0.17767714289473116 1.3861186197789771E-15
0.0 0.0 0.0
0.0 0.0 0.0
0.0 0.0 0.0
0.0 0.0 0.0
0.0 0.0 0.0
0.04576926231993421 -0.19015442157765644 -0.2886751345948127
0.04576926231993421 -0.19015442157765644 -0.2886751345948127
0.0 0.0 0.0
0.0457692623199342 -0.1901544215776565 0.2886751345948133
0.0 0.0 0.0
0.045769262319934195 -0.19015442157765644 0.2886751345948132
0.0 0.0 0.0
0.0 0.0 0.0
0.1500392768391739 0.19790644853974287 9.398634789071044E-17
0.1500392768391739 0.19790644853974287 9.398634789071044E-17
0.1500392768391739 0.19790644853974287 1.0786413569852489E-16
0.0 0.0 0.0
0.0 0.0 0.0
0.0 0.0 0.0
0.0 0.0 0.0
0.027003287938335827 -0.06422335350511683 -1.2937924328907353E-15
0.0 0.0 0.0
[0.13730778695980272,-0.5704632647329693,-0.8660254037844377]
[-1.1475707303690215,0.4087988370912453,1.237981573415841E-15]
[-0.8330237728377612,0.2845662454982972,-1.508990332030187E-15]
[-1.1475707303690215,0.4087988370912453,1.237981573415841E-15]
[0.6001571073566956,0.7916257941589713,6.951345111425742E-16]
[0.6001571073566956,0.7916257941589713,6.812567233347597E-16]
[0.6001571073566956,0.7916257941589713,6.812567233347597E-16]
[0.13730778695980259,-0.5704632647329695,0.8660254037844395]
[0.13730778695980259,-0.5704632647329695,0.8660254037844396]
[0.05400657587667165,-0.12844670701023367,-2.476562563318955E-15]
[0.13730778695980272,-0.5704632647329693,-0.8660254037844377]
//还有一种方式可以实现主成分降维
val source: Array[LabeledPoint] = data.collect().map(
(row: Row) => {
val doubles: Array[Double] = row.toString().replace("[", "").replaceAll("\\]", "").split(",").map((_: String).toDouble)
LabeledPoint(if (row(0).toString.toDouble > 1.0) 0.toDouble else 1.toDouble,Vectors.dense(doubles.slice(5, doubles.length)))
}
).take(11)
val model: PCAModel = new PCA(3)
.fit(spark.sparkContext.parallelize(source.map((_: LabeledPoint).features)))
source
.map((p: LabeledPoint) => p.copy(features = model.transform(p.features)))
.foreach(println)
// 可以看出来和上面的结果基本一致
(1.0,[-1.1475707303690215,0.4087988370912453,1.237981573415841E-15])
(0.0,[-1.1475707303690215,0.4087988370912453,1.237981573415841E-15])
(0.0,[0.6001571073566956,0.7916257941589713,6.812567233347597E-16])
(0.0,[0.13730778695980259,-0.5704632647329695,0.8660254037844395])
(0.0,[0.13730778695980259,-0.5704632647329695,0.8660254037844396])
(0.0,[0.13730778695980272,-0.5704632647329693,-0.8660254037844377])
(0.0,[-0.8330237728377612,0.2845662454982972,-1.508990332030187E-15])
(0.0,[0.6001571073566956,0.7916257941589713,6.951345111425742E-16])
(0.0,[0.6001571073566956,0.7916257941589713,6.812567233347597E-16])
(0.0,[0.05400657587667165,-0.12844670701023367,-2.476562563318955E-15])
(0.0,[0.13730778695980272,-0.5704632647329693,-0.8660254037844377])
结论:
PCA中每一列代表一个主成分,每一行代表原有的一个特征,最终我们将多维的原始矩阵投影到一个3维空间矩阵
MLlib提供的PCA变换方法最多只能处理65535维的数据。