1. 降维方法的种类

MLLIB中提供了两种密切相关的降维模型:主成分分析(PCA)和奇异值分解(SVD)

(1)主成分分析

主成分分析(PCA) 是一种对数据进行旋转变换的统计学方法,其本质是在线性空间中进行一个基变换,使得
	变换后的数据投影在一组新的“坐标轴”上的方差最大化,随后,裁剪掉变换后方差很小的“坐标轴”,剩下的新“坐标
	轴”即被称为 主成分(Principal Component) ,它们可以在一个较低维度的子空间中尽可能地表示原有数据的
	性质主成分分析被广泛应用在各种统计学、机器学习问题中,是最常见的降维方法之一。PCA有许多具体的实现方法,
	可以通过计算协方差矩阵,甚至是通过下文提到的SVD分解来进行PCA变换

(2)奇异值分解

SVD分解将一个m * n的矩阵X分解为3个主成分矩阵:
		U:m * m 矩阵,称为左奇异矩阵
		S:m * n 的对角矩阵,其中每个元素都是奇异值
		V^T:n * n 矩阵,称为右奇异矩阵
	
		表达式可以理解为:
				X = U * S * V^T
		所以,将U,S,V相乘就可以得到原始矩阵
	
		可是这样,我们感觉到数据并没有进行降维,因为在一般实际应用中,我们只是取前K个奇异值,前K个奇异值
	已经包含了大部分奇异能量(包含大部分特征),剩余的奇异值将被舍弃
		矩阵取前K个奇异值,矩阵X重建为:

LVs降维策略_LVs降维策略

2.代码案例

在spark中,SVD和PCA实现的方法都在行矩阵(RowMatrix)和行索引矩阵(IndexedRowMatrix)中。

(1) 奇异值案例

本次采用的数据集是自己处理过的一个向量数组
[1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
[1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
[0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
[0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
[0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
[0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
[1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
[0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0]
[0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
[0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0]
[0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
// 将数据转为rdd转为行矩阵
val matrix = new RowMatrix(spark.sparkContext.parallelize(source, 2))
// 行矩阵调用computeSVD方法实现SVD分解,取出前3个奇异值,第二个参数是要不要输出U,默认为False
val SVD: SingularValueDecomposition[RowMatrix, Matrix] = matrix.computeSVD(3,true)
// 输出u,s,v进行观察
println(SVD.V)
println("--------------")
println(SVD.s)
println("--------------")
SVD.U.rows.foreach(println)
U:
-0.7886751345948209      2.8310687127941492E-14   7.019940184704865E-13    
8.955633029205415E-17    -2.3635625206605712E-16  -0.3584926384897496      
8.295165706028774E-17    -5.659843664451338E-16   -0.733586869304055       
-8.958728223252302E-16   0.8660254037844384       -2.220446049250313E-16   
3.0593819819143217E-16   -1.5298387709094965E-15  -5.112897073717542E-14   
-0.21132486540521578     1.0902390101819037E-13   2.6206259384764508E-12   
-7.967072787865741E-19   -5.732309946204197E-18   -7.238036108091668E-17   
-0.5773502691896046      -8.060219158778636E-14   -1.918576408854733E-12   
-1.3693123044784444E-18  -4.084351409232032E-18   -2.4597196556933114E-17  
6.4372932348162975E-19   -5.033520172428588E-19   2.435490953940265E-17    
-6.968703838822414E-18   -1.4791851542802145E-17  -4.8191124489711415E-17  
-3.1295054986963717E-18  -3.982858105830577E-18   -7.383311186926517E-17   
6.123915859750583E-18    1.2823386884847707E-17   4.927314329312069E-17    
1.0818270557054564E-16   -3.646183221899531E-16   -0.17924631924487966     
4.420104950445495E-18    1.1108127532201064E-16   -0.17924631924486986     
-1.385491130079369E-18   9.635074221272796E-19    -4.443635227418211E-17   
1.5547794223628764E-16   -4.556011523317249E-16   -0.3667934346520335      
-7.785796156624745E-20   1.1114980987239887E-21   7.155390047092574E-18    
6.149407332841813E-17    6.045652652062488E-17    -0.36679343465202136     
1.1193143351271554E-18   1.5024165217837663E-18   -4.4423544623167647E-17  
3.624344793460294E-18    3.2015649241686225E-18   7.246491382877062E-17    
-4.188792726433983E-16   0.2886751345948131       3.497202527569243E-15    
-2.03485454614922E-16    0.2886751345948123       -1.1435297153639112E-14  
-3.537527903377281E-16   0.28867513459481325      7.382983113757291E-15    
5.334603575640233E-18    1.819920907720661E-17    6.53765963388012E-17     
-6.949557916364262E-19   -5.167529221926466E-18   -2.440236209941025E-17   
-4.2655967110073276E-19  1.3448860745800597E-19   -1.0753058201174548E-17  
1.8447384974463598E-18   -1.069172507550569E-18   6.114611272101371E-17    
3.3676844633013193E-16   -1.5511317263641213E-15  -5.108368030740601E-14   
-6.214568237441494E-18   -1.4700498749927554E-17  -6.747828598243039E-17
S:
[2.175327747161075,2.0,1.7320508075688772]
V:
[-0.6279630301995482,-2.6145752229922437E-14,-7.023941705797031E-13]
[9.090080155537957E-17,-3.004872871280051E-16,-0.310463732001837]
[-0.6279630301995482,-2.6145752229922437E-14,-7.023941705797031E-13]
[-0.45970084338099987,6.866729407306593E-14,1.918315526558137E-12]
[-5.744539480516396E-16,0.5773502691896258,4.134369775724663E-15]
[-6.043926468939928E-16,0.5773502691896257,1.890913308277171E-15]
[-5.053759270872523E-16,0.5773502691896253,-6.730369402342474E-15]
[6.640182408247018E-17,-2.5276391996225446E-16,-0.635304864700003]
[2.9545278653312463E-16,-1.5404852486368089E-15,-5.901250159517438E-14]
[1.0960628788362553E-16,-5.107927593884293E-16,-0.63530486470001]
[4.320104653891542E-17,-6.263748837202324E-17,-0.31046373200183136]

结论:

输出结果应证了刚才我们说的,U(左奇异矩阵)输出的是原始矩阵的列数 * K,S是设置取的三个奇异值,
V(右奇异矩阵)输出的是原始矩阵的行数 * K。

(2)主成分分析案例

// 依旧采用上面的数据集和行矩阵,保留三个主成分
val CPA: Matrix = matrix.computePrincipalComponents(3)
// 打印输出观察数据
println(CPA)
println("-----------")
// 可以通过矩阵乘法来完成对原矩阵的PCA变换
matrix.multiply(CPA).rows.foreach(println)
PCA:
-0.6601981677355941   0.23112169419651413   -1.4813704636313606E-16  
0.0915385246398685    -0.38030884315531277  -0.5773502691896251      
0.09153852463986839   -0.380308843155313    0.5773502691896263       
0.45011783051752163   0.5937193456192285    5.872703754440493E-16    
0.027003287938335827  -0.06422335350511683  -1.1827701304282196E-15  
-0.17282560510216705  0.053444551301783064  -1.360853285667051E-15   
0.0                   0.0                   0.0                      
-0.4873725626334274   0.17767714289473116   1.3861186197789771E-15   
0.0                   0.0                   0.0                      
0.0                   0.0                   0.0                      
0.0                   0.0                   0.0                      
0.0                   0.0                   0.0                      
0.0                   0.0                   0.0                      
0.04576926231993421   -0.19015442157765644  -0.2886751345948127      
0.04576926231993421   -0.19015442157765644  -0.2886751345948127      
0.0                   0.0                   0.0                      
0.0457692623199342    -0.1901544215776565   0.2886751345948133       
0.0                   0.0                   0.0                      
0.045769262319934195  -0.19015442157765644  0.2886751345948132       
0.0                   0.0                   0.0                      
0.0                   0.0                   0.0                      
0.1500392768391739    0.19790644853974287   9.398634789071044E-17    
0.1500392768391739    0.19790644853974287   9.398634789071044E-17    
0.1500392768391739    0.19790644853974287   1.0786413569852489E-16   
0.0                   0.0                   0.0                      
0.0                   0.0                   0.0                      
0.0                   0.0                   0.0                      
0.0                   0.0                   0.0                      
0.027003287938335827  -0.06422335350511683  -1.2937924328907353E-15  
0.0                   0.0                   0.0
[0.13730778695980272,-0.5704632647329693,-0.8660254037844377]
[-1.1475707303690215,0.4087988370912453,1.237981573415841E-15]
[-0.8330237728377612,0.2845662454982972,-1.508990332030187E-15]
[-1.1475707303690215,0.4087988370912453,1.237981573415841E-15]
[0.6001571073566956,0.7916257941589713,6.951345111425742E-16]
[0.6001571073566956,0.7916257941589713,6.812567233347597E-16]
[0.6001571073566956,0.7916257941589713,6.812567233347597E-16]
[0.13730778695980259,-0.5704632647329695,0.8660254037844395]
[0.13730778695980259,-0.5704632647329695,0.8660254037844396]
[0.05400657587667165,-0.12844670701023367,-2.476562563318955E-15]
[0.13730778695980272,-0.5704632647329693,-0.8660254037844377]
//还有一种方式可以实现主成分降维
val source: Array[LabeledPoint] = data.collect().map(
  (row: Row) => {
    val doubles: Array[Double] = row.toString().replace("[", "").replaceAll("\\]", "").split(",").map((_: String).toDouble)
    LabeledPoint(if (row(0).toString.toDouble > 1.0) 0.toDouble else 1.toDouble,Vectors.dense(doubles.slice(5, doubles.length)))
  }
).take(11)
    
val model: PCAModel = new PCA(3)
.fit(spark.sparkContext.parallelize(source.map((_: LabeledPoint).features)))

source
.map((p: LabeledPoint) => p.copy(features = model.transform(p.features)))
.foreach(println)
// 可以看出来和上面的结果基本一致
(1.0,[-1.1475707303690215,0.4087988370912453,1.237981573415841E-15])
(0.0,[-1.1475707303690215,0.4087988370912453,1.237981573415841E-15])
(0.0,[0.6001571073566956,0.7916257941589713,6.812567233347597E-16])
(0.0,[0.13730778695980259,-0.5704632647329695,0.8660254037844395])
(0.0,[0.13730778695980259,-0.5704632647329695,0.8660254037844396])
(0.0,[0.13730778695980272,-0.5704632647329693,-0.8660254037844377])
(0.0,[-0.8330237728377612,0.2845662454982972,-1.508990332030187E-15])
(0.0,[0.6001571073566956,0.7916257941589713,6.951345111425742E-16])
(0.0,[0.6001571073566956,0.7916257941589713,6.812567233347597E-16])
(0.0,[0.05400657587667165,-0.12844670701023367,-2.476562563318955E-15])
(0.0,[0.13730778695980272,-0.5704632647329693,-0.8660254037844377])

结论:

PCA中每一列代表一个主成分,每一行代表原有的一个特征,最终我们将多维的原始矩阵投影到一个3维空间矩阵
	MLlib提供的PCA变换方法最多只能处理65535维的数据。