安装IDEA及打包-常见问题
https://yq.aliyun.com/articles/60346?spm=5176.8251999.569296.68
版本问题很重要,修改版本后注意新建项目的时候版本也得匹配
参考https://www.zhihu.com/question/34099679
1. 安装scala插件
2. 新建项目选择scala-jdk-scala
3. 项目结构(快捷键F4)-包结构设计
4. Library-引入spark的jar包
IDEA本地执行
- 编译代码 Build->Make Project
- 编程运行参数,Run->Edit Configurations (Application)
- Run->Run或Alt+Shift+F10
import org.apache.spark.SparkContext._
import org.apache.spark.{SparkConf, SparkContext}
/**
* Created by yuyin on 17-1-3.
* 本地执行
*/
object SparkWord2 {
def main(args: Array[String]) {
//输入文件既可以是本地linux系统文件,也可以是其它来源文件,例如HDFS
if (args.length == 0) {
System.err.println("Usage: SparkWordCount <inputfile>")
System.exit(1)
}
//以本地线程方式运行,可以指定线程个数,
//如.setMaster("local[2]"),两个线程执行
//下面给出的是单线程执行
val conf = new SparkConf().setAppName("SparkWord2").setMaster("local")
val sc = new SparkContext(conf)
//wordcount操作,计算文件中包含Spark的行数
val count=sc.textFile(args(0)).filter(line => line.contains("ex")).count()
//打印结果
println("count="+count)
sc.stop()
}
}
打包提交集群
- 点击工程,然后按F4打个Project Structure并选择Artifacts
- 选择Jar->form modules with dependencies
- 在main class中,选择SparkWord3
- 点击确定后为减小jar包的体积,将spark-assembly-1.5.0-hadoop2.4.0.jar等jar包删除
- 确定后,再点击Build->Build Artifacts
- 生成后的jar文件保存在~/IdeaProjects/Spark02/out/artifacts/Spark02_jar
- 提交集群运行 执行
./spark-submit --master spark://sparkmaster:7077 --class SparkWord3 --executor-memory 1g /home/yuyin/IdeaProjects/Spark02/out/artifacts/Spark02_jar/Spark02.jar hdfs://ns1/README.md hdfs://ns1/SparkWordCountResult
代码
import org.apache.spark.{SparkConf, SparkContext}
/**
* Created by yuyin on 17-1-3.
* 集群提交
*/
object SparkWord3 {
def main(args: Array[String]) {
//输入文件既可以是本地linux系统文件,也可以是其它来源文件,例如HDFS
if (args.length == 0) {
System.err.println("Usage: SparkWordCount <inputfile> <outputfile>")
System.exit(1)
}
val conf = new SparkConf().setAppName("SparkWordCount")
val sc = new SparkContext(conf)
//rdd2为所有包含Spark的行
val rdd2=sc.textFile(args(0)).filter(line => line.contains("Spark"))
//保存内容,在例子中是保存在HDFS上
rdd2.saveAsTextFile(args(1))
sc.stop()
}
}
RDD 常用Transformation函数
union并集
union将两个RDD数据集元素合并,类似两个集合的并集
val rdd1=sc.parallelize(1 to 5)
val rdd2=sc.parallelize(4 to 8)
rdd1.union(rdd2).collect
res0: Array[Int] = Array(1, 2, 3, 4, 5, 4, 5, 6, 7, 8)
intersection交集
rdd1.intersection(rdd2).collect
res1: Array[Int] = Array(4, 5)
distinct去除重复元素
rdd1.union(rdd2).distinct.collect
res2: Array[Int] = Array(8, 1, 2, 3, 4, 5, 6, 7)
groupByKey([numTasks]) 合并相同key
输入数据为(K, V) 对, 返回的是 (K, Iterable) ,numTasks指定task数量,该参数是可选的
rdd1.union(rdd2).map((_,1)).groupByKey.collect
res3: Array[(Int, Iterable[Int])] = Array((8,CompactBuffer(1)), (1,CompactBuffer(1)), (2,CompactBuffer(1)), (3,CompactBuffer(1)), (4,CompactBuffer(1, 1)), (5,CompactBuffer(1, 1)), (6,CompactBuffer(1)), (7,CompactBuffer(1)))
reduceByKey(func, [numTasks]) 聚合
reduceByKey函数输入数据为(K, V)对,返回的数据集结果也是(K,V)对,只不过V为经过聚合操作后的值
rdd1.union(rdd2).map((_,1)).reduceByKey(_+_).collect
res4: Array[(Int, Int)] = Array((8,1), (1,1), (2,1), (3,1), (4,2), (5,2), (6,1), (7,1))
sortByKey([ascending], [numTasks]) 排序
对输入的数据集按key排序 true升序false降序
var data = sc.parallelize(List((1,3),(1,2),(1, 4),(2,3),(7,9),(2,4)))
data.sortByKey(true).collect
res5: Array[(Int, Int)] = Array((1,3), (1,2), (1,4), (2,3), (2,4), (7,9))
data.sortByKey(false).collect
res7: Array[(Int, Int)] = Array((7,9), (2,3), (2,4), (1,3), (1,2), (1,4))
join(otherDataset, [numTasks]) 连接
对于数据集类型为 (K, V) 及 (K, W)的RDD,join操作后返回类型为 (K, (V, W)),join函数有三种:
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]
val rdd1=sc.parallelize(Array((1,2),(1,3)))
val rdd2=sc.parallelize(Array((1,3)))
rdd1.join(rdd2).collect
res10: Array[(Int, (Int, Int))] = Array((1,(2,3)), (1,(3,3)))
def leftOuterJoin[W](
other: RDD[(K, W)],
partitioner: Partitioner): RDD[(K, (V, Option[W]))]
rdd1.leftOuterJoin(rdd2).collect
res12: Array[(Int, (Int, Option[Int]))] = Array((1,(2,Some(3))), (1,(3,Some(3))))
RDD[(K, (Option[V], W))]
rdd1.rightOuterJoin(rdd2).collect
res13: Array[(Int, (Option[Int], Int))] = Array((1,(Some(2),3)), (1,(Some(3),3)))
cogroup(otherDataset, [numTasks])
如果输入的RDD类型为(K, V) 和(K, W),则返回的RDD类型为 (K, (Iterable, Iterable)) . 该操作与 groupWith等同
rdd1.cogroup(rdd2).collect
res14: Array[(Int, (Iterable[Int], Iterable[Int]))] = Array((1,(CompactBuffer(2, 3),CompactBuffer(3))))
rdd1.groupWith(rdd2).collect
res15: Array[(Int, (Iterable[Int], Iterable[Int]))] = Array((1,(CompactBuffer(2, 3),CompactBuffer(3))))
cartesian(otherDataset) 笛卡尔积
求两个RDD数据集间的笛卡尔积
val rdd1=sc.parallelize(Array(1,2,3,4))
val rdd2=sc.parallelize(Array(5,6))
rdd1.cartesian(rdd2).collect
res16: Array[(Int, Int)] = Array((1,5), (1,6), (2,5), (2,6), (3,5), (3,6), (4,5), (4,6))
coalesce(numPartitions) 减少分区
将RDD的分区数减至指定的numPartitions分区数
val rdd1=sc.parallelize(1 to 100,3)
val rdd2=rdd1.coalesce(2)
repartition(numPartitions),功能与coalesce函数相同,实质上它调用的就是coalesce函数,只不是shuffle = true,意味着可能会导致大量的网络开销。
repartitionAndSortWithinPartitions
repartitionAndSortWithinPartitions函数是repartition函数的变种,与repartition函数不同的是,repartitionAndSortWithinPartitions在给定的partitioner内部进行排序,性能比repartition要高。
val data = sc.parallelize(List((1,3),(1,2),(5,4),(1, 4),(2,3),(2,4)),3)
import org.apache.spark.HashPartitioner
data.repartitionAndSortWithinPartitions(new HashPartitioner(3)).collect
res3: Array[(Int, Int)] = Array((1,4), (1,3), (1,2), (2,3), (2,4), (5,4))
RDD actions
reduce() 累计
reduce采样累加或关联操作减少RDD中元素的数量
val data=sc.parallelize(1 to 3)
data.reduce((x,y)=>x+y)
res20: Int = 6
data.reduce(_+_)
res21: Int = 6
count()计数
data.count
res23: Long = 3
first()第一个元素
data.first
res24: Int = 1
take(n)
data.take(2)
res25: Array[Int] = Array(1, 2)
takeSample(withReplacement, num, [seed])采样
对RDD中的数据进行是否有放回的采样
val data=sc.parallelize(1 to 9)
data.takeSample(false,5)
res28: Array[Int] = Array(2, 6, 1, 5, 7)
data.takeSample(true,5,2)
res32: Array[Int] = Array(8, 8, 6, 8, 9)
takeOrdered(n, [ordering]) 取最小的前几个
含隐式排序
sc.parallelize(Seq(10, 4, 2, 12, 3)).takeOrdered(2)
res34: Array[Int] = Array(2, 3)
saveAsTextFile(path)
将RDD保存到文件,本地模式时保存在本地文件,集群模式指如果在Hadoop基础上则保存在HDFS上
countByKey()
将RDD中的数据按Key计数
val data = sc.parallelize(List((1,3),(1,2),(5,4),(1, 4),(2,3),(2,4)),3)
data.countByKey()
res37: scala.collection.Map[Int,Long] = Map(1 -> 3, 5 -> 1, 2 -> 2)
foreach(func)
foreach方法遍历RDD中所有的元素
val data = sc.parallelize(List((1,3),(1,2),(1, 4),(2,3),(2,4)))
data.foreach(x=>println("key="+x._1+",value="+x._2))
key=1,value=2
key=1,value=4
key=2,value=3
key=2,value=4
key=1,value=3
参见API文档:http://spark.apache.org/docs/latest/api/scala/index.html
spark-submit提交参数
./bin/spark-submit \
--class <main-class>
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
本地模式
./spark-submit --master local
--class SparkWordCount
--executor-memory 1g
/root/IdeaProjects/SparkWordCount/out/artifacts/SparkWordCount_jar/SparkWordCount.jar
file:/hadoopLearning/spark-1.5.0-bin-hadoop2.4/README.md
file:/SparkWordCountResult
Standalone集群运行方式
./spark-submit --master spark://sparkmaster:7077
--class SparkWordCount --executor-memory 1g
/root/IdeaProjects/SparkWordCount/out/artifacts/SparkWordCount_jar/SparkWordCount.jar
file:/hadoopLearning/spark-1.5.0-bin-hadoop2.4/README.md
file:/SparkWordCountResult2
Yarn运行方式
./spark-submit --master yarn-cluster
--class org.apache.spark.examples.SparkPi
--executor-memory 1g
/root/IdeaProjects/SparkWordCount/out/artifacts/SparkWordCount_jar/SparkWordCount.jar
spark运行过程
参考
https://yq.aliyun.com/articles/60342?spm=5176.100239.blogcont60343.9.hHw25F
Spark SQL 与DataFrame
Spark SQL的运行原理可参见:
DataFrame方法与临时表SQL语句方法
读取json格式数据
{“name”:”Michael”}
{“name”:”Andy”, “age”:30}
{“name”:”Justin”, “age”:19}
//自己创建RDD数据
case class Person(name:String,age:Int)
val data = sc.parallelize(List(('a',18),('b',21)))
//data.map(x=>Person(x._1.toString,x._2.toInt)).collect
val df=data.map(x=>Person(x._1.toString,x._2.toInt)).toDF()
//从json中获取数据
val df = sqlContext.read.json("/data/people.json")
//查看DataFrame元数据信息
df.printSchema()
//返回DataFrame某列所有数据
df.select("name").show()
//DataFrame数据过滤
df.filter(df("age") > 19).show()
//按年龄分组
df.groupBy("age").count().show()
//注册成表
df.registerTempTable("people")
//执行SparkSQL
val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 29")
//结果格式化输出
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
//显示表
df.show()
//显示前2行
df.show(2)
//显示树形结构
df.printSchema()
//查询某字段内容
df.select("name").show()
// 查询所有内容,但是age字段的内容+1
df.select(df("name"), df("age") + 1).show()
//查询满足某条件的字段--操作可以转换成表,然后用sql语句查询
df.where('age >= 10).where('age <= 39).select('name).show()
df.where("age >= 10").where("age <= 39").select("name").show()
//查询满足某条件的内容
df.filter(df("age") > 21).show()
// 某字段分组后的数量 分组计数--操作可以转换成表,然后用sql语句查询
df.groupBy("age").count().show()
// 左联表(注意是3个等号!)--操作可以转换成表,然后用sql语句查询
df.join(df2, df("name") === df2("name"), "left").show()
示例对于DataFrame与临时表使用
case class Person(author:String,commit:Int)
val data = sc.parallelize(List(('a',18),('b',4),('b',1),('b',2),('a',20),('b',10)))
val df=data.map(x=>Person(x._1.toString,x._2.toInt)).toDF()
//显式前两行数据
df.show(2)
//计算总行数
df.count
//按提交次数进行降序排序
df.groupBy("author").count.sort($"count".desc).show
DataFrame注册成临时表使用实战
//将DataFrame注册成表commitlog
val commitLog=df.registerTempTable("commitlog")
//显示前2行数据
sqlContext.sql("SELECT * FROM commitlog").show(2)
//计算总行数
sqlContext.sql("SELECT count(*) as TotalCommitNumber FROM commitlog").show
//按提交次数进行降序排序
sqlContext.sql("SELECT author,count(*) as CountNumber FROM commitlog GROUP BY author ORDER BY CountNumber DESC").show
SparkSQL应用案例
Date.txt格式如下
//Date.txt文件定义了日期的分类,将每天分别赋予所属的月份、星期、季度等属性
//日期,年月,年,月,日,周几,第几周,季度,旬、半月
2014-12-24,201412,2014,12,24,3,52,4,36,24
Stock.txt格式如下:
//Stock.txt文件定义了订单表头
//订单号,交易位置,交易日期
ZYSL00014630,ZY,2009-5-7
StockDetail.txt格式如下:
//订单号,行号,货品,数量,价格,金额
HMJSL00006421,9,QY524266010101,1,80,80
案例实战-查询所有订单中每年的销售单数、销售总额
//定义case class用于后期创建DataFrame schema
//对应Date.txt
case class DateInfo(dateID:String,theyearmonth :String,theyear:String,themonth:String,thedate :String,theweek:String,theweeks:String,thequot :String,thetenday:String,thehalfmonth:String)
//对应Stock.txt
case class StockInfo(ordernumber:String,locationid :String,dateID:String)
//对应StockDetail.txt
case class StockDetailInfo(ordernumber:String,rownum :Int,itemid:String,qty:Int,price:Double,amount:Double)
//加载数据并转换成DataFrame
val DateInfoDF = sc.textFile("/data/Date.txt").map(_.split(",")).map(d => DateInfo(d(0), d(1),d(2),d(3),d(4),d(5),d(6),d(7),d(8),d(9))).toDF()
//加载数据并转换成DataFrame
val StockInfoDF= sc.textFile("/data/Stock.txt").map(_.split(",")).map(s => StockInfo(s(0), s(1),s(2))).toDF()
//加载数据并转换成DataFrame
val StockDetailInfoDF = sc.textFile("/data/StockDetail.txt").map(_.split(",")).map(s => StockDetailInfo(s(0), s(1).trim.toInt,s(2),s(3).trim.toInt,s(4).trim.toDouble,s(5).trim.toDouble)).toDF()
//注册成表
DateInfoDF.registerTempTable("tblDate")
StockInfoDF.registerTempTable("tblStock")
StockDetailInfoDF.registerTempTable("tblStockDetail")
//执行SQL
//所有订单中每年的销售单数、销售总额
//三个表连接后以count(distinct a.ordernumber)计销售单数,sum(b.amount)计销售总额
sqlContext.sql("select c.theyear,count(distinct a.ordernumber),sum(b.amount) from tblStock a join tblStockDetail b on a.ordernumber=b.ordernumber join tblDate c on a.dateid=c.dateid group by c.theyear order by c.theyear").collect().foreach(println)
案例实战-求所有订单每年最大金额订单的销售额:
sqlContext.sql("select c.theyear,max(d.sumofamount) from tblDate c join (select a.dateid,a.ordernumber,sum(b.amount) as sumofamount from tblStock a join tblStockDetail b on a.ordernumber=b.ordernumber group by a.dateid,a.ordernumber ) d on c.dateid=d.dateid group by c.theyear sort by c.theyear").collect().foreach(println)
spark-streaming
参考https://yq.aliyun.com/articles/60316?spm=5176.8251999.569296.76
单词计数
import org.apache.spark.SparkConf
import org.apache.spark.HashPartitioner
import org.apache.spark.streaming._
/**
* Created by yuyin on 17/1/5.
* 参数/Users/yuyin/Downloads/software/spark/streaming
* 启动后/streaming目录下执行 echo "A B C D" >> test12.txt; echo "A B" >> test12.txt
*/
object SparkStreaming {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("StatefulNetworkWordCount").setMaster("local[4]")
//每一秒处理一次
val ssc = new StreamingContext(sparkConf, Seconds(1))
//读取本地文件~/streaming文件夹
val lines = ssc.textFileStream(args(0))
val words = lines.flatMap(_.split(" "))
val wordMap = words.map(x => (x, 1))
val wordCounts=wordMap.reduceByKey(_ + _)
val filteredWordCounts=wordCounts.filter(_._2>1)
val numOfCount=filteredWordCounts.count()
val countByValue=words.countByValue()
val union=words.union(words)
val transform=words.transform(x=>x.map(x=>(x,1)))
//显式原文件
lines.print()
// A B C D
// A B
//打印flatMap结果
words.print()
// A
// B
// C
// D
// A
// B
//打印map结果
wordMap.print()
// (A,1)
// (B,1)
// (C,1)
// (D,1)
// (A,1)
// (B,1)
//打印reduceByKey结果
wordCounts.print()
// (D,1)
// (A,2)
// (B,2)
// (C,1)
//打印filter结果
filteredWordCounts.print()
// (A,2)
// (B,2)
//打印count结果
numOfCount.print()
// 2
//打印countByValue结果
countByValue.print()
// (D,1)
// (A,2)
// (B,2)
// (C,1)
//打印union结果
union.print()
// A
// B
// C
// D
// A
// B
// A
// B
// C
// D
// ...
//打印transform结果
transform.print()
// (A,1)
// (B,1)
// (C,1)
// (D,1)
// (A,1)
// (B,1)
ssc.start()
ssc.awaitTermination()
}
}
import org.apache.spark.{HashPartitioner, SparkConf}
import org.apache.spark.streaming._
/**
* Created by yuyin on 17/1/5.
* 参数 localhost 9999
* 启动netcat server命令:nc -lk 9999
* 输入hello / world
*/
object SparkStreamingWordCount {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println("Usage: StatefulNetworkWordCount <hostname> <port>")
System.exit(1)
}
//函数字面量,输入的当前值与前一次的状态结果进行累加
val updateFunc = (values: Seq[Int], state: Option[Int]) => {
val currentCount = values.sum
val previousCount = state.getOrElse(0)
Some(currentCount + previousCount)
}
//输入类型为K,V,S,返回值类型为K,S
//V对应为带求和的值,S为前一次的状态
val newUpdateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => {
iterator.flatMap(t => updateFunc(t._2, t._3).map(s => (t._1, s)))
}
val sparkConf = new SparkConf().setAppName("StatefulNetworkWordCount").setMaster("local[4]")
//每一秒处理一次
val ssc = new StreamingContext(sparkConf, Seconds(1))
//当前目录为checkpoint结果目录,后面会讲checkpoint在Spark Streaming中的应用
ssc.checkpoint(".")
//RDD的初始化结果
val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1), ("world", 1)))
//使用Socket作为输入源,本例ip为localhost,端口为9999
val lines = ssc.socketTextStream(args(0), args(1).toInt)
//flatMap操作
val words = lines.flatMap(_.split(" "))
//map操作
val wordDstream = words.map(x => (x, 1))
//updateStateByKey函数使用
val stateDstream = wordDstream.updateStateByKey[Int](newUpdateFunc,
new HashPartitioner (ssc.sparkContext.defaultParallelism), true, initialRDD)
stateDstream.print()
ssc.start()
ssc.awaitTermination()
}
}
Spark Streaming窗口 DStream Window操作
窗口每滑动一次,落在该窗口中的RDD被一起同时处理,生成一个窗口DStream(windowed DStream),窗口操作需要设置两个参数:
(1)窗口长度(window length),即窗口的持续时间,上图中的窗口长度为3
(2)滑动间隔(sliding interval),窗口操作执行的时间间隔,上图中的滑动间隔为2
这两个参数必须是原始DStream 批处理间隔(batch interval)的整数倍
参考https://yq.aliyun.com/articles/60316?spm=5176.8251999.569296.76
reduceByKeyAndWindow
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
/**
* Created by yuyin on 17/1/9.
* 1传入的参数为localhost 9999 30 10 持续30秒 每次窗口10秒
* 2启动netcat server nc -lk 9999
* Spark is a fast and general cluster computing system for Big Data. It provides
*/
object WindowWordCount {
def main(args: Array[String]) {
//传入的参数为localhost 9999 30 10
if (args.length != 4) {
System.err.println("Usage: WindowWorldCount <hostname> <port> <windowDuration> <slideDuration>")
System.exit(1)
}
// StreamingExamples.setStreamingLogLevels()
val conf = new SparkConf().setAppName("WindowWordCount").setMaster("local[4]")
val sc = new SparkContext(conf)
// 创建StreamingContext,batch interval为5秒
val ssc = new StreamingContext(sc, Seconds(5))
//Socket为数据源
val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_ONLY_SER)
val words = lines.flatMap(_.split(" "))
// windows操作,对窗口中的单词进行计数
val wordCounts = words.map(x => (x , 1)).reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(args(2).toInt), Seconds(args(3).toInt))
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
countByWindow方法使用
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
/**
* Created by yuyin on 17/1/9.
* * 1传入的参数为localhost 9999 30 10 持续30秒 每次窗口10秒
* 2启动netcat server nc -lk 9999
* Spark is a fast and general cluster computing system for Big Data. It provides
*/
object WindowWordCount2 {
def main(args: Array[String]) {
if (args.length != 4) {
System.err.println("Usage: WindowWorldCount <hostname> <port> <windowDuration> <slideDuration>")
System.exit(1)
}
// StreamingExamples.setStreamingLogLevels()
val conf = new SparkConf().setAppName("WindowWordCount").setMaster("local[2]")
val sc = new SparkContext(conf)
// 创建StreamingContext
val ssc = new StreamingContext(sc, Seconds(5))
// 定义checkpoint目录为当前目录
ssc.checkpoint(".")
val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_ONLY_SER)
val words = lines.flatMap(_.split(" "))
//countByWindowcountByWindow方法计算基于滑动窗口的DStream中的元素的数量。
val countByWindow=words.countByWindow(Seconds(args(2).toInt), Seconds(args(3).toInt))
countByWindow.print()
ssc.start()
ssc.awaitTermination()
}
}
reduceByWindow方法使用
reduceByWindow方法基于滑动窗口对源DStream中的元素进行聚合操作,返回包含单元素的一个新的DStream。
//reduceByWindow方法基于滑动窗口对源DStream中的元素进行聚合操作,返回包含单元素的一个新的DStream。
val reduceByWindow=words.map(x=>1).reduceByWindow(_+_,_-_Seconds(args(2).toInt), Seconds(args(3).toInt))
下面两个方法得到的结果是一样的,只是效率不同,后面的方法方式效率更高:
//以过去5秒钟为一个输入窗口,每1秒统计一下WordCount,本方法会将过去5秒钟的每一秒钟的WordCount都进行统计
//然后进行叠加,得出这个窗口中的单词统计。 这种方式被称为叠加方式
val wordCounts = words.map(x => (x, 1)).reduceByKeyAndWindow(_ + _, Seconds(5s),seconds(1))
//计算t+4秒这个时刻过去5秒窗口的WordCount,可以将t+3时刻过去5秒的统计量加上[t+3,t+4]的统计量
//再减去[t-2,t-1]的统计量,这种方法可以复用中间三秒的统计量,提高统计的效率。 这种方式被称为增量方式
val wordCounts = words.map(x => (x, 1)).reduceByKeyAndWindow(_ + _, _ - _, Seconds(5s),seconds(1))
Spark SQL、DataFrame与Spark Streaming结合
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Time, Seconds, StreamingContext}
import org.apache.spark.util.IntParam
import org.apache.spark.sql.SQLContext
import org.apache.spark.storage.StorageLevel
/**
* Created by yuyin on 17/1/9.
* 1传入的参数为localhost 9999
* 2启动netcat server nc -lk 9999
*/
object SqlNetworkWordCount {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println("Usage: NetworkWordCount <hostname> <port>")
System.exit(1)
}
// StreamingExamples.setStreamingLogLevels()
// Create the context with a 2 second batch size
val sparkConf = new SparkConf().setAppName("SqlNetworkWordCount").setMaster("local[4]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create a socket stream on target ip:port and count the
// words in input stream of \n delimited text (eg. generated by 'nc')
// Note that no duplication in storage level only for running locally.
// Replication necessary in distributed scenario for fault tolerance.
//Socke作为数据源
val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)
//words DStream
val words = lines.flatMap(_.split(" "))
// Convert RDDs of the words DStream to DataFrame and run SQL query
//调用foreachRDD方法,遍历DStream中的RDD
words.foreachRDD((rdd: RDD[String], time: Time) => {
// Get the singleton instance of SQLContext
val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
import sqlContext.implicits._
// Convert RDD[String] to RDD[case class] to DataFrame
val wordsDataFrame = rdd.map(w => Record(w)).toDF()
// Register as table
wordsDataFrame.registerTempTable("words")
// Do word count on table using SQL and print it
val wordCountsDataFrame =
sqlContext.sql("select word, count(*) as total from words group by word")
println(s"========= $time =========")
wordCountsDataFrame.show()
})
ssc.start()
ssc.awaitTermination()
}
}
/** Case class for converting RDD to DataFrame */
case class Record(word: String)
/** Lazily instantiated singleton instance of SQLContext */
object SQLContextSingleton {
@transient private var instance: SQLContext = _
def getInstance(sparkContext: SparkContext): SQLContext = {
if (instance == null) {
instance = new SQLContext(sparkContext)
}
instance
}
}
Streaming 缓存、Checkpoint机制
Spark Stream 缓存
DStream是由一系列的RDD构成的,它同一般的RDD一样,也可以将流式数据持久化到内容当中,采用的同样是persisit方法,调用该方法后DStream将持久化所有的RDD数据。这对于一些需要重复计算多次或数据需要反复被使用的DStream特别有效
参数参考:
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
Checkpoint机制
有两种数据可以chekpoint:
(1)Metadata checkpointing
将流式计算的信息保存到具备容错性的存储上如HDFS,Metadata Checkpointing适用于当streaming应用程序Driver所在的节点出错时能够恢复,元数据包括:
Configuration(配置信息) - 创建streaming应用程序的配置信息
DStream operations - 在streaming应用程序中定义的DStreaming操作
Incomplete batches - 在列队中没有处理完的作业
(2)Data checkpointing
将生成的RDD保存到外部可靠的存储当中,对于一些数据跨度为多个bactch的有状态tranformation操作来说,checkpoint非常有必要,因为在这些transformation操作生成的RDD对前一RDD有依赖,随着时间的增加,依赖链可能会非常长,checkpoint机制能够切断依赖链,将中间的RDD周期性地checkpoint到可靠存储当中,从而在出错时可以直接从checkpoint点恢复。
具体来说,metadata checkpointing主要还是从drvier失败中恢复,而Data Checkpoing用于对有状态的transformation操作进行checkpointing
Checkpointing具体的使用方式时通过下列方法:
//checkpointDirectory为checkpoint文件保存目录
streamingContext.checkpoint(checkpointDirectory)
案例
启动后创建checkpoint目录。
手动停止,重新运行将从checkpoint目录中恢复
import java.io.File
import java.nio.charset.Charset
import com.google.common.io.Files
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Time, Seconds, StreamingContext}
import org.apache.spark.util.IntParam
/**
* Counts words in text encoded with UTF8 received from the network every second.
*
* Usage: RecoverableNetworkWordCount <hostname> <port> <checkpoint-directory> <output-file>
* <hostname> and <port> describe the TCP server that Spark Streaming would connect to receive
* data. <checkpoint-directory> directory to HDFS-compatible file system which checkpoint data
* <output-file> file to which the word counts will be appended
*
* <checkpoint-directory> and <output-file> must be absolute paths
*
* To run this on your local machine, you need to first run a Netcat server
*
* `$ nc -lk 9999`
*
* and run the example as
*
* `$ ./bin/run-example org.apache.spark.examples.streaming.RecoverableNetworkWordCount \
* localhost 9999 ~/checkpoint/ ~/out`
*
* If the directory ~/checkpoint/ does not exist (e.g. running for the first time), it will create
* a new StreamingContext (will print "Creating new context" to the console). Otherwise, if
* checkpoint data exists in ~/checkpoint/, then it will create StreamingContext from
* the checkpoint data.
*
* Refer to the online documentation for more details.
*/
/**
* Created by yuyin on 17/1/9.
* 1传入的参数为localhost 9999 /Users/yuyin/Downloads/software/scala/checkpoint/ /Users/yuyin/Downloads/software/scala/out
* 2启动netcat server nc -lk 9999
*/
object RecoverableNetworkWordCount {
def createContext(ip: String, port: Int, outputPath: String, checkpointDirectory: String)
: StreamingContext = {
//程序第一运行时会创建该条语句,如果应用程序失败,则会从checkpoint中恢复,该条语句不会执行
println("Creating new context")
val outputFile = new File(outputPath)
if (outputFile.exists()) outputFile.delete()
val sparkConf = new SparkConf().setAppName("RecoverableNetworkWordCount").setMaster("local[4]")
// Create the context with a 1 second batch size
val ssc = new StreamingContext(sparkConf, Seconds(1))
ssc.checkpoint(checkpointDirectory)
//将socket作为数据源
val lines = ssc.socketTextStream(ip, port)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.foreachRDD((rdd: RDD[(String, Int)], time: Time) => {
val counts = "Counts at time " + time + " " + rdd.collect().mkString("[", ", ", "]")
println(counts)
println("Appending to " + outputFile.getAbsolutePath)
Files.append(counts + "\n", outputFile, Charset.defaultCharset())
})
ssc
}
//将String转换成Int
private object IntParam {
def unapply(str: String): Option[Int] = {
try {
Some(str.toInt)
} catch {
case e: NumberFormatException => None
}
}
}
def main(args: Array[String]) {
if (args.length != 4) {
System.err.println("You arguments were " + args.mkString("[", ", ", "]"))
System.err.println(
"""
|Usage: RecoverableNetworkWordCount <hostname> <port> <checkpoint-directory>
| <output-file>. <hostname> and <port> describe the TCP server that Spark
| Streaming would connect to receive data. <checkpoint-directory> directory to
| HDFS-compatible file system which checkpoint data <output-file> file to which the
| word counts will be appended
|
|In local mode, <master> should be 'local[n]' with n > 1
|Both <checkpoint-directory> and <output-file> must be absolute paths
""".stripMargin
)
System.exit(1)
}
val Array(ip, IntParam(port), checkpointDirectory, outputPath) = args
//getOrCreate方法,从checkpoint中重新创建StreamingContext对象或新创建一个StreamingContext对象
val ssc = StreamingContext.getOrCreate(checkpointDirectory,
() => {
createContext(ip, port, outputPath, checkpointDirectory)
})
ssc.start()
ssc.awaitTermination()
}
}
Spark Streaming与Kafka结合
import org.apache.kafka.clients.producer.{ProducerConfig, KafkaProducer, ProducerRecord}
import org.apache.log4j.{Level, Logger}
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.{Logging, SparkConf}
//参数sparkmaster:2181 test-consumer-group kafkatopictest 1 接受kafka传过来的消息做Wordcount
object KafkaWordCount {
def main(args: Array[String]) {
if (args.length < 4) {
System.err.println("Usage: KafkaWordCount <zkQuorum> <group> <topics> <numThreads>")
System.exit(1)
}
// StreamingExamples.setStreamingLogLevels()
val Array(zkQuorum, group, topics, numThreads) = args
val sparkConf = new SparkConf().setAppName("KafkaWordCount").setMaster("local[4]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("checkpoint")
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
//创建ReceiverInputDStream
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L))
.reduceByKeyAndWindow(_ + _, _ - _, Minutes(10), Seconds(2), 2)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
Spark MLlib
密度向量与稀疏矩阵
import org.apache.spark.mllib.linalg.{Vector, Vectors}
//密度矩阵,零值也存储
scala> val dv: Vector = Vectors.dense(1.0, 0.0, 3.0)
dv: org.apache.spark.mllib.linalg.Vector = [1.0,0.0,3.0]
// 创建稀疏矩阵,指定元素的个数、索引及非零值,数组方式
scala> val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))
sv1: org.apache.spark.mllib.linalg.Vector = (3,[0,2],[1.0,3.0])
// 创建稀疏矩阵,指定元素的个数、索引及非零值,采用序列方式
scala> val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))
sv2: org.apache.spark.mllib.linalg.Vector = (3,[0,2],[1.0,3.0])
//密度矩阵的存储
scala> import org.apache.spark.mllib.linalg.{Matrix, Matrices}
import org.apache.spark.mllib.linalg.{Matrix, Matrices}
//创建一个密度矩阵
scala> val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))
dm: org.apache.spark.mllib.linalg.Matrix =
1.0 2.0
3.0 4.0
5.0 6.0
//下列矩阵
1.0 0.0 4.0
0.0 3.0 5.0
2.0 0.0 6.0
如果采用稀疏矩阵存储的话,其存储信息包括:
实际存储值: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]`,
矩阵元素对应的行索引:rowIndices=[0, 2, 1, 0, 1, 2]`
列起始位置索引: `colPointers=[0, 2, 3, 6]`.
使用的是CSC 行索引是每列的位置 列索引0代表第一列1.0从0偏移位置开始2代表第二列3.0是从2偏移位置,3代表第三列4.0从3偏移位置开始,6代表总的元素个数
http://www.tuicool.com/articles/A3emmqi?spm=5176.100239.blogcont60351.3.QnIY01
scala> val sparseMatrix= Matrices.sparse(3, 3, Array(0, 2, 3, 6), Array(0, 2, 1, 0, 1, 2), Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0))
sparseMatrix: org.apache.spark.mllib.linalg.Matrix =
3 x 3 CSCMatrix
(0,0) 1.0
(2,0) 2.0
(1,1) 3.0
(0,2) 4.0
(1,2) 5.0
(2,2) 6.0
带类标签的特征向量(Labeled point)
scala> import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LabeledPoint
// LabeledPoint第一个参数是类标签数据,第二参数是对应的特征数据
//下面给出的是其密度向量实现方式
scala> val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
pos: org.apache.spark.mllib.regression.LabeledPoint = (1.0,[1.0,0.0,3.0])
// LabeledPoint的稀疏向量实现方式
scala> val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
neg: org.apache.spark.mllib.regression.LabeledPoint = (0.0,(3,[0,2],[1.0,3.0]))
实际中常常使用稀疏的实现方式,使用的是LIBSVM格式:label index1:value1 index2:value2 …进行特征标签及特征的存储与读取
val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "/data/sample_data.txt")
分布式矩阵RowMatrix与CoordinateMatrix
package cn.ml.datastruct
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix
object RowMatrixDedmo extends App {
val sparkConf = new SparkConf().setAppName("RowMatrixDemo").setMaster("spark://sparkmaster:7077")
val sc = new SparkContext(sparkConf)
// 创建RDD[Vector]
val rdd1= sc.parallelize(
Array(
Array(1.0,2.0,3.0,4.0),
Array(2.0,3.0,4.0,5.0),
Array(3.0,4.0,5.0,6.0)
)
).map(f => Vectors.dense(f))
//创建RowMatrix
val rowMatirx = new RowMatrix(rdd1)
//计算列之间的相似度,返回的是CoordinateMatrix,采用
//case class MatrixEntry(i: Long, j: Long, value: Double)存储值
var coordinateMatrix:CoordinateMatrix= rowMatirx.columnSimilarities()
//返回矩阵行数、列数
println(coordinateMatrix.numCols())
println(coordinateMatrix.numRows())
//查看返回值,查看列与列之间的相似度
//Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry]
//= Array(MatrixEntry(2,3,0.9992204753914715),
//MatrixEntry(0,1,0.9925833339709303),
//MatrixEntry(1,2,0.9979288897338914),
//MatrixEntry(0,3,0.9746318461970762),
//MatrixEntry(1,3,0.9946115458726394),
//MatrixEntry(0,2,0.9827076298239907))
println(coordinateMatrix.entries.collect())
//转成后块矩阵,下一节中详细讲解
coordinateMatrix.toBlockMatrix()
//转换成索引行矩阵,下一节中详细讲解
coordinateMatrix.toIndexedRowMatrix()
//转换成RowMatrix
coordinateMatrix.toRowMatrix()
//计算列统计信息
var mss:MultivariateStatisticalSummary=rowMatirx.computeColumnSummaryStatistics()
//每列的均值, org.apache.spark.mllib.linalg.Vector = [2.0,3.0,4.0,5.0]
mss.mean
// 每列的最大值org.apache.spark.mllib.linalg.Vector = [3.0,4.0,5.0,6.0]
mss.max
// 每列的最小值 org.apache.spark.mllib.linalg.Vector = [1.0,2.0,3.0,4.0]
mss.min
//每列非零元素的个数org.apache.spark.mllib.linalg.Vector = [3.0,3.0,3.0,3.0]
mss.numNonzeros
//矩阵列的1-范数,||x||1 = sum(abs(xi));
//org.apache.spark.mllib.linalg.Vector = [6.0,9.0,12.0,15.0]
mss.normL1
//矩阵列的2-范数,||x||2 = sqrt(sum(xi.^2));
// org.apache.spark.mllib.linalg.Vector = [3.7416573867739413,5.385164807134504,7.0710678118654755,8.774964387392123]
mss.normL2
//矩阵列的方差
//org.apache.spark.mllib.linalg.Vector = [1.0,1.0,1.0,1.0]
mss.variance
//计算协方差
//covariance: org.apache.spark.mllib.linalg.Matrix =
//1.0 1.0 1.0 1.0
//1.0 1.0 1.0 1.0
//1.0 1.0 1.0 1.0
//1.0 1.0 1.0 1.0
var covariance:Matrix=rowMatirx.computeCovariance()
//计算拉姆矩阵rowMatirx^T*rowMatirx,T表示转置操作
//gramianMatrix: org.apache.spark.mllib.linalg.Matrix =
//14.0 20.0 26.0 32.0
//20.0 29.0 38.0 47.0
//26.0 38.0 50.0 62.0
//32.0 47.0 62.0 77.0
var gramianMatrix:Matrix=rowMatirx.computeGramianMatrix()
//对矩阵进行主成分分析,参数指定返回的列数,即主分成个数
//PCA算法是一种经典的降维算法
//principalComponents: org.apache.spark.mllib.linalg.Matrix =
//-0.5000000000000002 0.8660254037844388
//-0.5000000000000002 -0.28867513459481275
//-0.5000000000000002 -0.28867513459481287
//-0.5000000000000002 -0.28867513459481287
var principalComponents=rowMatirx.computePrincipalComponents(2)
/**
* 对矩阵进行奇异值分解,设矩阵为A(m x n). 奇异值分解将计算三个矩阵,分别是U,S,V
* 它们满足 A ~= U * S * V', S包含了设定的k个奇异值,U,V为相应的奇异值向量
*/
// svd: org.apache.spark.mllib.linalg.SingularValueDecomposition[org.apache.spark.mllib.linalg.distributed.RowMatrix,org.apache.spark.mllib.linalg.Matrix] =
//SingularValueDecomposition(org.apache.spark.mllib.linalg.distributed.RowMatrix@688884e,[13.011193721236575,0.8419251442105343,7.793650306633694E-8],-0.2830233037672786 -0.7873358937103356 -0.5230588083704528
//-0.4132328277901395 -0.3594977469144485 0.5762839813994667
//-0.5434423518130005 0.06834039988143598 0.4166084623124157
//-0.6736518758358616 0.4961785466773299 -0.4698336353414313 )
var svd:SingularValueDecomposition[RowMatrix, Matrix]=rowMatirx.computeSVD(3,true)
//矩阵相乘积操作
var multiplyMatrix:RowMatrix=rowMatirx.multiply(Matrices.dense(4, 1, Array(1,2,3,4)))
}
IndexedRowMatrix
带索引的RowMatrix
index表示的就是它的索引,vector表示其要存储的内容
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
import org.apache.spark.mllib.linalg._
/**
* Created by yuyin on 17/1/9.
*/
object RowMatrixDedmo extends App{
val sparkConf = new SparkConf().setAppName("RowMatrixDemo").setMaster("spark://sparkmaster:7077")
val sc = new SparkContext(sparkConf)
// 创建RDD[Vector]
val rdd1= sc.parallelize(
Array(
Array(1.0,2.0,3.0,4.0),
Array(2.0,3.0,4.0,5.0),
Array(3.0,4.0,5.0,6.0)
)
).map(f => Vectors.dense(f))
//创建RowMatrix
val rowMatirx = new RowMatrix(rdd1)
//计算列之间的相似度,返回的是CoordinateMatrix,采用
//case class MatrixEntry(i: Long, j: Long, value: Double)存储值
var coordinateMatrix:CoordinateMatrix= rowMatirx.columnSimilarities()
//返回矩阵行数、列数
println(coordinateMatrix.numCols())
println(coordinateMatrix.numRows())
//查看返回值,查看列与列之间的相似度
//Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry]
//= Array(MatrixEntry(2,3,0.9992204753914715),
//MatrixEntry(0,1,0.9925833339709303),
//MatrixEntry(1,2,0.9979288897338914),
//MatrixEntry(0,3,0.9746318461970762),
//MatrixEntry(1,3,0.9946115458726394),
//MatrixEntry(0,2,0.9827076298239907))
println(coordinateMatrix.entries.collect())
//转成后块矩阵,下一节中详细讲解
coordinateMatrix.toBlockMatrix()
//转换成索引行矩阵,下一节中详细讲解
coordinateMatrix.toIndexedRowMatrix()
//转换成RowMatrix
coordinateMatrix.toRowMatrix()
//计算列统计信息
var mss:MultivariateStatisticalSummary=rowMatirx.computeColumnSummaryStatistics()
//每列的均值, org.apache.spark.mllib.linalg.Vector = [2.0,3.0,4.0,5.0]
mss.mean
// 每列的最大值org.apache.spark.mllib.linalg.Vector = [3.0,4.0,5.0,6.0]
mss.max
// 每列的最小值 org.apache.spark.mllib.linalg.Vector = [1.0,2.0,3.0,4.0]
mss.min
//每列非零元素的个数org.apache.spark.mllib.linalg.Vector = [3.0,3.0,3.0,3.0]
mss.numNonzeros
//矩阵列的1-范数,||x||1 = sum(abs(xi));
//org.apache.spark.mllib.linalg.Vector = [6.0,9.0,12.0,15.0]
mss.normL1
//矩阵列的2-范数,||x||2 = sqrt(sum(xi.^2));
// org.apache.spark.mllib.linalg.Vector = [3.7416573867739413,5.385164807134504,7.0710678118654755,8.774964387392123]
mss.normL2
//矩阵列的方差
//org.apache.spark.mllib.linalg.Vector = [1.0,1.0,1.0,1.0]
mss.variance
//计算协方差
//covariance: org.apache.spark.mllib.linalg.Matrix =
//1.0 1.0 1.0 1.0
//1.0 1.0 1.0 1.0
//1.0 1.0 1.0 1.0
//1.0 1.0 1.0 1.0
var covariance:Matrix=rowMatirx.computeCovariance()
//计算拉姆矩阵rowMatirx^T*rowMatirx,T表示转置操作
//gramianMatrix: org.apache.spark.mllib.linalg.Matrix =
//14.0 20.0 26.0 32.0
//20.0 29.0 38.0 47.0
//26.0 38.0 50.0 62.0
//32.0 47.0 62.0 77.0
var gramianMatrix:Matrix=rowMatirx.computeGramianMatrix()
//对矩阵进行主成分分析,参数指定返回的列数,即主分成个数
//PCA算法是一种经典的降维算法
//principalComponents: org.apache.spark.mllib.linalg.Matrix =
//-0.5000000000000002 0.8660254037844388
//-0.5000000000000002 -0.28867513459481275
//-0.5000000000000002 -0.28867513459481287
//-0.5000000000000002 -0.28867513459481287
var principalComponents=rowMatirx.computePrincipalComponents(2)
/**
* 对矩阵进行奇异值分解,设矩阵为A(m x n). 奇异值分解将计算三个矩阵,分别是U,S,V
* 它们满足 A ~= U * S * V', S包含了设定的k个奇异值,U,V为相应的奇异值向量
*/
// svd: org.apache.spark.mllib.linalg.SingularValueDecomposition[org.apache.spark.mllib.linalg.distributed.RowMatrix,org.apache.spark.mllib.linalg.Matrix] =
//SingularValueDecomposition(org.apache.spark.mllib.linalg.distributed.RowMatrix@688884e,[13.011193721236575,0.8419251442105343,7.793650306633694E-8],-0.2830233037672786 -0.7873358937103356 -0.5230588083704528
//-0.4132328277901395 -0.3594977469144485 0.5762839813994667
//-0.5434423518130005 0.06834039988143598 0.4166084623124157
//-0.6736518758358616 0.4961785466773299 -0.4698336353414313 )
var svd:SingularValueDecomposition[RowMatrix, Matrix]=rowMatirx.computeSVD(3,true)
//矩阵相乘积操作
var multiplyMatrix:RowMatrix=rowMatirx.multiply(Matrices.dense(4, 1, Array(1,2,3,4)))
}
BlockMatrix的使用
import org.apache.spark.mllib.linalg.distributed.BlockMatrix
import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix
import org.apache.spark.mllib.linalg.distributed.MatrixEntry
import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix
import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg.distributed.IndexedRow
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.SparkConf
/**
* Created by yuyin on 17/1/9.
*/
object BlockMatrixDemo extends App{
val sparkConf = new SparkConf().setAppName("BlockMatrixDemo").setMaster("spark://sparkmaster:7077") //这里指在本地运行,2个线程
val sc = new SparkContext(sparkConf)
implicit def double2long(x:Double)=x.toLong
val rdd1= sc.parallelize(
Array(
Array(1.0,20.0,30.0,40.0),
Array(2.0,50.0,60.0,70.0),
Array(3.0,80.0,90.0,100.0)
)
).map(f => IndexedRow(f.take(1)(0),Vectors.dense(f.drop(1))))
val indexRowMatrix = new IndexedRowMatrix(rdd1)
//将IndexedRowMatrix转换成BlockMatrix,指定每块的行列数
val blockMatrix:BlockMatrix=indexRowMatrix.toBlockMatrix(2, 2)
//执行后的打印内容:
//Index:(0,0)MatrixContent:2 x 2 CSCMatrix
//(1,0) 20.0
//(1,1) 30.0
//Index:(1,1)MatrixContent:2 x 1 CSCMatrix
//(0,0) 70.0
//(1,0) 100.0
//Index:(1,0)MatrixContent:2 x 2 CSCMatrix
//(0,0) 50.0
//(1,0) 80.0
//(0,1) 60.0
//(1,1) 90.0
//Index:(0,1)MatrixContent:2 x 1 CSCMatrix
//(1,0) 40.0
//从打印内容可以看出:各分块矩阵采用的是稀疏矩阵CSC格式存储
blockMatrix.blocks.foreach(f=>println("Index:"+f._1+"MatrixContent:"+f._2))
//转换成本地矩阵
//0.0 0.0 0.0
//20.0 30.0 40.0
//50.0 60.0 70.0
//80.0 90.0 100.0
//从转换后的内容可以看出,在indexRowMatrix.toBlockMatrix(2, 2)
//操作时,指定行列数与实际矩阵内容不匹配时,会进行相应的零值填充
blockMatrix.toLocalMatrix()
//块矩阵相加
blockMatrix.add(blockMatrix)
//块矩阵相乘blockMatrix*blockMatrix^T(T表示转置)
blockMatrix.multiply(blockMatrix.transpose)
//转换成CoordinateMatrix
blockMatrix.toCoordinateMatrix()
//转换成IndexedRowMatrix
blockMatrix.toIndexedRowMatrix()
//验证分块矩阵的合法性
blockMatrix.validate()
}
org.apache.spark.mllib.stat包及子包-统计基础
http://spark.apache.org/docs/latest/mllib-statistics.html#kernel-density-estimation
获取矩阵列(column-wise)统计信息
如每列的最大值、最小值、均值等其它统计特征
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.mllib.stat.MultivariateStatisticalSummary
object StatisticsDemo extends App {
val sparkConf = new SparkConf().setAppName("StatisticsDemo").setMaster("spark://sparkmaster:7077")
val sc = new SparkContext(sparkConf)
val rdd1= sc.parallelize(
Array(
Array(1.0,2.0,3.0,4.0),
Array(2.0,3.0,4.0,5.0),
Array(3.0,4.0,5.0,6.0)
)
).map(f => Vectors.dense(f))
//在第一节中,我们使用过该MultivariateStatisticalSummary该类,通过下列方法
// var mss:MultivariateStatisticalSummary=rowMatirx.computeColumnSummaryStatistics()
// 这里是通过Statistics方法去获取相关统计信息,它们的内部实现原理是一致的,最终返回其实都是
// MultivariateOnlineSummarizer的实例(下一小节将讲解该类)
//Statistics.colStats方法它的源码如下:
// def colStats(X: RDD[Vector]): MultivariateStatisticalSummary = {
// new RowMatrix(X).computeColumnSummaryStatistics()
//}
//可以看到 Statistics.colStats方法调用的是RowMatrix中的computeColumnSummaryStatistics方法
val mss:MultivariateStatisticalSummary=Statistics.colStats(rdd1)
//因此下列方面返回的结果与第一节通过调用computeColumnSummaryStatistics得到的结果
//返回值是一致的
mss.max
mss.min
mss.normL1
//其它normL2等统计信息
}
Kernel density estimation(核密度估计)
Spark中只实现了高斯核函数
import org.apache.spark.mllib.stat.KernelDensity
val sample = sc.parallelize(Seq(0.0, 1.0, 4.0, 4.0))
val kernelDensity=new KernelDensity()
.setSample(sample) //设置密度估计样本
.setBandwidth(3.0) //设置带宽,对高斯核函数来讲就是标准差
//给定相应的点,估计其概率密度
//densities: Array[Double] =
//Array(0.07464879256673691, 0.1113106036883375, 0.08485447240456075)
val densities = kernelDensity.estimate(Array(-1.0, 2.0, 5.0))
Hypothesis testing(假设检验)
import org.apache.spark.mllib.stat.test.ChiSqTestResult
val land1 = Vectors.dense(1000.0, 1856.0)
val land2 = Vectors.dense(400, 560)
val c1 = Statistics.chiSqTest(land1, land2)
Correlation 相关性分析
Spark中只实现了两种相关性分析方法,分别是皮尔逊(Pearson)与斯皮尔曼(Spearman)相关性分析方法
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.mllib.stat._
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.{Matrix, Vector}
object CorrelationDemo extends App {
val sparkConf = new SparkConf().setAppName("StatisticsDemo").setMaster("spark://sparkmaster:7077")
val sc = new SparkContext(sparkConf)
val rdd1:RDD[Double] = sc.parallelize(Array(11.0, 21.0, 13.0, 14.0))
val rdd2:RDD[Double] = sc.parallelize(Array(11.0, 20.0, 13.0, 16.0))
//两个rdd间的相关性
//返回值:correlation: Double = 0.959034501397483
//[-1, 1],值越接近于1,其相关度越高
val correlation:Double = Statistics.corr(rdd1, rdd2, "pearson")
val rdd3:RDD[Vector]= sc.parallelize(
Array(
Array(1.0,2.0,3.0,4.0),
Array(2.0,3.0,4.0,5.0),
Array(3.0,4.0,5.0,6.0)
)
).map(f => Vectors.dense(f))
//correlation3: org.apache.spark.mllib.linalg.Matrix =
//1.0 1.0 1.0 1.0
//1.0 1.0 1.0 1.0
//1.0 1.0 1.0 1.0
//1.0 1.0 1.0 1.0
val correlation3:Matrix = Statistics.corr(rdd3, "pearson")
}
分层采样(Stratified sampling)
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.PairRDDFunctions
import org.apache.spark.SparkConf
object StratifiedSampleDemo extends App {
val sparkConf = new SparkConf().setAppName("StatisticsDemo").setMaster("spark://sparkmaster:7077")
val sc = new SparkContext(sparkConf)
//读取HDFS上的README.md文件
val textFile = sc.textFile("/README.md")
//wordCount操作,返回(K,V)汇总结果
val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
//定义key为spark,采样比率为0.5
val fractions: Map[String, Double] = Map("Spark"->0.5)
//使用sampleByKey方法进行采样
val approxSample = wordCounts.sampleByKey(false, fractions)
//使用sampleByKeyExact方法进行采样,该方法资源消耗较sampleByKey更大
//但采样后的大小与预期大小更接近,可信度达到99.99%
val exactSample = wordCounts.sampleByKeyExact(false, fractions)
}
// an RDD[(K, V)] of any key value pairs
val data = sc.parallelize(
Seq((1, 'a'), (1, 'b'), (2, 'c'), (2, 'd'), (2, 'e'), (3, 'f')))
// specify the exact fraction desired from each key
val fractions = Map(1 -> 0.1, 2 -> 0.6, 3 -> 0.3)
// Get an approximate sample from each stratum
val approxSample = data.sampleByKey(withReplacement = false, fractions = fractions)
// Get an exact sample from each stratum
val exactSample = data.sampleByKeyExact(withReplacement = false, fractions = fractions)
随机数据生成(Random data generation)
scala> import org.apache.spark.SparkContext
import org.apache.spark.SparkContext
scala> import org.apache.spark.mllib.random.RandomRDDs._
import org.apache.spark.mllib.random.RandomRDDs._
//生成100个服从标准正态分面N(0,1)的随机RDD数据,10为指定的分区数
scala> val u = normalRDD(sc, 100L, 10)
u: org.apache.spark.rdd.RDD[Double] = RandomRDD[26] at RDD at RandomRDD.scala:38
//转换使其服从N(1,4)的正太分布
scala> val v = u.map(x => 1.0 + 2.0 * x)
v: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[27] at map at <console>:27
Spark MLlib算法
参考另一篇我的文章