Apache Spark Day2
DD Operations
RDD支持两种类型的操作:transformations-转换,将一个已经存在的RDD转换为一个新的RDD,另外一种称为actions-动作,动作算子一般在执行结束以后,会将结果返回给Driver。在Spark中所有的transformations都是lazy的,所有转换算子并不会立即执行,它们仅仅是记录对当前RDD的转换逻辑。仅当Actions算子要求将结果返回给Driver程序时transformations才开始真正的进行转换计算。这种设计使Spark可以更高效地运行。
默认情况下,每次在其上执行操作时,都可能会重新计算每个转换后的RDD。但是,您也可以使用persist(或cache)方法将RDD保留在内存中,在这种情况下,Spark会将元素保留在群集中,以便下次查询时可以更快地进行访问。
scala> var rdd1=sc.textFile("hdfs:///words/src").map(line=>line.split(" ").length)
rdd1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[117] at map at <console>:24
scala> rdd1.cache
res54: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[117] at map at <console>:24
scala> rdd1.reduce(_+_)
res55: Int = 15
scala> rdd1.reduce(_+_)
res56: Int = 15
Spark还支持将RDD持久存储在磁盘上,或在多个节点之间复制。比如用户可调用persist(StorageLevel.DISK_ONLY_2)
将RDD存储在磁盘上,并且存储2份。
Transformations
参考:http://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations
√map(func)
Return a new distributed dataset formed by passing each element of the source through a function func.
将一个RDD[U] 转换为 RRD[T]类型。在转换的时候需要用户提供一个匿名函数func: U => T
scala> var rdd:RDD[String]=sc.makeRDD(List("a","b","c","a"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[120] at makeRDD at <console>:25
scala> val mapRDD:RDD[(String,Int)] = rdd.map(w => (w, 1))
mapRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[121] at map at <console>:26
√filter(func)
Return a new dataset formed by selecting those elements of the source on which func returns true.
将对一个RDD[U]类型元素进行过滤,过滤产生新的RDD[U],但是需要用户提供func:U => Boolean
系统仅仅会保留返回true的元素。
scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[122] at makeRDD at <console>:25
scala> val mapRDD:RDD[Int]=rdd.filter(num=> num %2 == 0)
mapRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[123] at filter at <console>:26
scala> mapRDD.collect
res63: Array[Int] = Array(2, 4)
√flatMap(func)
Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
和map类似,也是将一个RDD[U] 转换为 RRD[T]类型。但是需要用户提供一个方法func:U => Seq[T]
scala> var rdd:RDD[String]=sc.makeRDD(List("this is","good good"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[124] at makeRDD at <console>:25
scala> var flatMapRDD:RDD[(String,Int)]=rdd.flatMap(line=> for(i<- line.split("\\s+")) yield (i,1))
flatMapRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[125] at flatMap at <console>:26
scala> var flatMapRDD:RDD[(String,Int)]=rdd.flatMap( line=> line.split("\\s+").map((_,1)))
flatMapRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[126] at flatMap at <console>:26
scala> flatMapRDD.collect
res64: Array[(String, Int)] = Array((this,1), (is,1), (good,1), (good,1))
√mapPartitions(func)
Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type
Iterator<T> => Iterator<U>
when running on an RDD of type T.
和map类似,但是该方法的输入时一个分区的全量数据,因此需要用户提供一个分区的转换方法:func:Iterator<T> => Iterator<U>
scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[128] at makeRDD at <console>:25
scala> var mapPartitionsRDD=rdd.mapPartitions(values => values.map(n=>(n,n%2==0)))
mapPartitionsRDD: org.apache.spark.rdd.RDD[(Int, Boolean)] = MapPartitionsRDD[129] at mapPartitions at <console>:26
scala> mapPartitionsRDD.collect
res70: Array[(Int, Boolean)] = Array((1,false), (2,true), (3,false), (4,true), (5,false))
√mapPartitionsWithIndex(func)
Similar to mapPartitions, but also provides func with an integer value representing the index of the partition, so func must be of type
(Int, Iterator<T>) => Iterator<U>
when running on an RDD of type T.
和mapPartitions类似,但是该方法会提供RDD元素所在的分区编号。因此func:(Int, Iterator<T>) => Iterator<U>
scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5,6),2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[139] at makeRDD at <console>:25
scala> var mapPartitionsWithIndexRDD=rdd.mapPartitionsWithIndex((p,values) => values.map(n=>(n,p)))
mapPartitionsWithIndexRDD: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[140] at mapPartitionsWithIndex at <console>:26
scala> mapPartitionsWithIndexRDD.collect
res77: Array[(Int, Int)] = Array((1,0), (2,0), (3,0), (4,1), (5,1), (6,1))
sample(withReplacement, fraction, seed)
Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed.
抽取RDD中的样本数据,可以通过withReplacement
:是否允许重复抽样、fraction
:控制抽样大致比例、seed
:控制的是随机抽样过程中产生随机数。
scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5,6))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[150] at makeRDD at <console>:25
scala> var simpleRDD:RDD[Int]=rdd.sample(false,0.5d,1L)
simpleRDD: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[151] at sample at <console>:26
scala> simpleRDD.collect
res91: Array[Int] = Array(1, 5, 6)
种子不一样,会影响最终的抽样结果!
union(otherDataset)
Return a new dataset that contains the union of the elements in the source dataset and the argument.
是将两个同种类型的RDD的元素进行合并。
scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5,6))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[154] at makeRDD at <console>:25
scala> var rdd2:RDD[Int]=sc.makeRDD(List(6,7))
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[155] at makeRDD at <console>:25
scala> rdd.union(rdd2).collect
res95: Array[Int] = Array(1, 2, 3, 4, 5, 6, 6, 7)
intersection(otherDataset)
Return a new RDD that contains the intersection of elements in the source dataset and the argument.
是将两个同种类型的RDD的元素进行计算交集。
scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5,6))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[154] at makeRDD at <console>:25
scala> var rdd2:RDD[Int]=sc.makeRDD(List(6,7))
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[155] at makeRDD at <console>:25
scala> rdd.intersection(rdd2).collect
res100: Array[Int] = Array(6)
distinct([numPartitions]))
Return a new dataset that contains the distinct elements of the source dataset.
去除RDD中重复元素,其中numPartitions是一个可选参数,是否修改RDD的分区数,一般是在当数据集经过去重之后,如果数据量级大规模降低,可以尝试传递numPartitions减少分区数。
scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5,6,5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[154] at makeRDD at <console>:25
scala> rdd.distinct(3).collect
res106: Array[Int] = Array(6, 3, 4, 1, 5, 2)
√join(otherDataset, [numPartitions])
When called on datasets of type (K, V) and (K, W), returns a dataset of
(K, (V, W))
pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.
当调用RDD[(K,V)]和RDD[(K,W)]系统可以返回一个新的RDD[(k,(v,w))](默认内连接),目前支持 leftOuterJoin, rightOuterJoin, 和 fullOuterJoin.
scala> var userRDD:RDD[(Int,String)]=sc.makeRDD(List((1,"zhangsan"),(2,"lisi")))
userRDD: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[204] at makeRDD at <console>:25
scala> case class OrderItem(name:String,price:Double,count:Int)
defined class OrderItem
scala> var orderItemRDD:RDD[(Int,OrderItem)]=sc.makeRDD(List((1,OrderItem("apple",4.5,2))))
orderItemRDD: org.apache.spark.rdd.RDD[(Int, OrderItem)] = ParallelCollectionRDD[206] at makeRDD at <console>:27
scala> userRDD.join(orderItemRDD).collect
res107: Array[(Int, (String, OrderItem))] = Array((1,(zhangsan,OrderItem(apple,4.5,2))))
scala> userRDD.leftOuterJoin(orderItemRDD).collect
res108: Array[(Int, (String, Option[OrderItem]))] = Array((1,(zhangsan,Some(OrderItem(apple,4.5,2)))), (2,(lisi,None)))
cogroup(otherDataset, [numPartitions])-了解
When called on datasets of type (K, V) and (K, W), returns a dataset of
(K, (Iterable<V>, Iterable<W>))
tuples. This operation is also calledgroupWith
.
scala> var userRDD:RDD[(Int,String)]=sc.makeRDD(List((1,"zhangsan"),(2,"lisi")))
userRDD: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[204] at makeRDD at <console>:25
scala> var orderItemRDD:RDD[(Int,OrderItem)]=sc.makeRDD(List((1,OrderItem("apple",4.5,2)),(1,OrderItem("pear",1.5,2))))
orderItemRDD: org.apache.spark.rdd.RDD[(Int, OrderItem)] = ParallelCollectionRDD[215] at makeRDD at <console>:27
scala> userRDD.cogroup(orderItemRDD).collect
res110: Array[(Int, (Iterable[String], Iterable[OrderItem]))] = Array((1,(CompactBuffer(zhangsan),CompactBuffer(OrderItem(apple,4.5,2), OrderItem(pear,1.5,2)))), (2,(CompactBuffer(lisi),CompactBuffer())))
scala> userRDD.groupWith(orderItemRDD).collect
res119: Array[(Int, (Iterable[String], Iterable[OrderItem]))] = Array((1,(CompactBuffer(zhangsan),CompactBuffer(OrderItem(apple,4.5,2), OrderItem(pear,1.5,2)))), (2,(CompactBuffer(lisi),CompactBuffer())))
cartesian(otherDataset)-了解
When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).
计算集合笛卡尔积
scala> var rdd1:RDD[Int]=sc.makeRDD(List(1,2,4))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[238] at makeRDD at <console>:25
scala> var rdd2:RDD[String]=sc.makeRDD(List("a","b","c"))
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[239] at makeRDD at <console>:25
scala> rdd1.cartesian(rdd2).collect
res120: Array[(Int, String)] = Array((1,a), (1,b), (1,c), (2,a), (2,b), (2,c), (4,a), (4,b), (4,c))
coalesce(numPartitions)
Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.
当经过大规模的过滤数据以后,可以使coalesce
对RDD进行分区的缩小(只能减少分区,不可以增加)。
scala> var rdd1:RDD[Int]=sc.makeRDD(0 to 100)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[252] at makeRDD at <console>:25
scala> rdd1.getNumPartitions
res129: Int = 6
scala> rdd1.filter(n=> n%2 == 0).coalesce(3).getNumPartitions
res127: Int = 3
scala> rdd1.filter(n=> n%2 == 0).coalesce(12).getNumPartitions
res128: Int = 6
repartition(numPartitions)
Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.
和coalesce相似,但是该算子能够变大或者缩小RDD的分区数。
scala> var rdd1:RDD[Int]=sc.makeRDD(0 to 100)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[252] at makeRDD at <console>:25
scala> rdd1.getNumPartitions
res129: Int = 6
scala> rdd1.filter(n=> n%2 == 0).repartition(12).getNumPartitions
res130: Int = 12
scala> rdd1.filter(n=> n%2 == 0).repartition(3).getNumPartitions
res131: Int = 3
repartitionAndSortWithinPartitions(partitioner)-了解
Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling
repartition
and then sorting within each partition because it can push the sorting down into the shuffle machinery.
该算子能够使用用户提供的partitioner
实现对RDD中数据分区,然后对分区内的数据按照他们key进行排序。
scala> case class User(name:String,deptNo:Int)
defined class User
var empRDD:RDD[User]= sc.parallelize(List(User("张三",1),User("lisi",2),User("wangwu",1)))
empRDD.map(t => (t.deptNo, t.name)).repartitionAndSortWithinPartitions(new Partitioner {
override def numPartitions: Int = 4
override def getPartition(key: Any): Int = {
key.hashCode() & Integer.MAX_VALUE % numPartitions
}
}).mapPartitionsWithIndex((p,values)=> {
//println(p+"\t"+values.mkString("|"))
values.map(v=>(p,v))
}).collect()
思考
1、如果有两个超大型文件需要join,有何优化策略?
√xxxByKey-算子(掌握)
在Spark中专门针对RDD[(K,V)]类型数据集提供了xxxByKey算子实现对RDD[(K,V)]类型针对性实现计算。
- groupByKey([numPartitions])
When called on a dataset of
(K, V)
pairs, returns a dataset of(K, Iterable<V>)
pairs.
类似于MapReduce计算模型。将RDD[(K, V)]转换为RDD[ (K, Iterable)]
scala> var lines=sc.parallelize(List("this is good good"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> lines.flatMap(_.split("\\s+")).map((_,1)).groupByKey.collect
res3: Array[(String, Iterable[Int])] = Array((this,CompactBuffer(1)), (is,CompactBuff)), (good,CompactBuffer(1, 1)))
- groupBy(f:(k,v)=> T)
scala> var lines=sc.parallelize(List("this is good good"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> lines.flatMap(_.split("\\s+")).map((_,1)).groupBy(t=>t._1)
res5: org.apache.spark.rdd.RDD[(String, Iterable[(String, Int)])] = ShuffledRDD[18] at groupBy at <console>:26
scala> lines.flatMap(_.split("\\s+")).map((_,1)).groupBy(t=>t._1).map(t=>(t._1,t._2.size)).collect
res6: Array[(String, Int)] = Array((this,1), (is,1), (good,2))
- reduceByKey(func, [numPartitions])
When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in
groupByKey
, the number of reduce tasks is configurable through an optional second argument.
scala> var lines=sc.parallelize(List("this is good good"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> lines.flatMap(_.split("\\s+")).map((_,1)).reduceByKey(_+_).collect
res8: Array[(String, Int)] = Array((this,1), (is,1), (good,2))
- aggregateByKey(zeroValue)(seqOp, combOp, [numPartitions])
When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral “zero” value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like in
groupByKey
, the number of reduce tasks is configurable through an optional second argument.
scala> var lines=sc.parallelize(List("this is good good"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> lines.flatMap(_.split("\\s+")).map((_,1)).aggregateByKey(0)(_+_,_+_).collect
res9: Array[(String, Int)] = Array((this,1), (is,1), (good,2))
- sortByKey([ascending], [numPartitions])
When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean
ascending
argument.
scala> var lines=sc.parallelize(List("this is good good"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> lines.flatMap(_.split("\\s+")).map((_,1)).aggregateByKey(0)(_+_,_+_).sortByKey(true).collect
res13: Array[(String, Int)] = Array((good,2), (is,1), (this,1))
scala> lines.flatMap(_.split("\\s+")).map((_,1)).aggregateByKey(0)(_+_,_+_).sortByKey(false).collect
res14: Array[(String, Int)] = Array((this,1), (is,1), (good,2))
- sortBy(T=>U,ascending,[numPartitions])
scala> var lines=sc.parallelize(List("this is good good"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> lines.flatMap(_.split("\\s+")).map((_,1)).aggregateByKey(0)(_+_,_+_).sortBy(_._2,false).collect
res18: Array[(String, Int)] = Array((good,2), (this,1), (is,1))
scala> lines.flatMap(_.split("\\s+")).map((_,1)).aggregateByKey(0)(_+_,_+_).sortBy(t=>t._2,true).collect
res19: Array[(String, Int)] = Array((this,1), (is,1), (good,2))
Actions
Spark任何一个计算任务,有且仅有一个动作算子,用于触发job的执行。将RDD中的数据写出到外围系统或者RDD的数据传递给Driver主程序。
reduce(func)
Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.
该算子能够对远程结果进行计算,然后将计算结果返回给Driver。计算文件中的字符数。
scala> sc.textFile("hdfs:///words/src").map(_.split("\\s+").length).reduce(_+_)
res56: Int = 13
collect()
Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
将远程RDD中数据传输给Driver端。通常用于测试环境或者RDD中数据非常的小的情况才可以使用collect算子,否则Driver可能因为数据太大导致内存溢出。
scala> sc.textFile("hdfs:///words/src").collect
res58: Array[String] = Array(this is a demo, good good study, day day up, come on baby)
一般用于测试,将分布式RDD数据转换为本地的Array数组。
√foreach(func)
Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems.
在数据集的每个元素上运行函数func。通常这样做是出于副作用,例如更新累加器或与外部存储系统
交互。
scala> sc.textFile("file:///root/t_word").foreach(line=>println(line))
count()
Return the number of elements in the dataset.
返回RDD中元素的个数
scala> sc.textFile("file:///root/t_word").count()
res7: Long = 5
first()|take(n)
Return the first element of the dataset (similar to take(1)). take(n) Return an array with the first n elements of the dataset.
scala> sc.textFile("file:///root/t_word").first
res9: String = this is a demo
scala> sc.textFile("file:///root/t_word").take(1)
res10: Array[String] = Array(this is a demo)
scala> sc.textFile("file:///root/t_word").take(2)
res11: Array[String] = Array(this is a demo, hello spark)
takeSample(withReplacement, num, [seed])
Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.
随机的从RDD中采样num个元素,并且将采样的元素返回给Driver主程序。因此这和sample转换算子有很大的区别。
scala> sc.textFile("file:///root/t_word").takeSample(false,2)
res20: Array[String] = Array("good good study ", hello spark)
takeOrdered(n, [ordering])
Return the first n elements of the RDD using either their natural order or a custom comparator.
返回RDD中前N个元素,用户可以指定比较规则
scala> case class User(name:String,deptNo:Int,salary:Double)
defined class User
scala> var userRDD=sc.parallelize(List(User("zs",1,1000.0),User("ls",2,1500.0),User("ww",2,1000.0)))
userRDD: org.apache.spark.rdd.RDD[User] = ParallelCollectionRDD[51] at parallelize at <console>:26
scala> userRDD.takeOrdered
def takeOrdered(num: Int)(implicit ord: Ordering[User]): Array[User]
scala> userRDD.takeOrdered(3)
<console>:26: error: No implicit Ordering defined for User.
userRDD.takeOrdered(3)
scala> implicit var userOrder=new Ordering[User]{
| override def compare(x: User, y: User): Int = {
| if(x.deptNo!=y.deptNo){
| x.deptNo.compareTo(y.deptNo)
| }else{
| x.salary.compareTo(y.salary) * -1
| }
| }
| }
userOrder: Ordering[User] = $anon$1@7066f4bc
scala> userRDD.takeOrdered(3)
res23: Array[User] = Array(User(zs,1,1000.0), User(ls,2,1500.0), User(ww,2,1000.0))
√saveAsTextFile(path)
Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.
Spark会调用RDD中元素的toString方法将元素以文本行的形式写入到文件中。
scala> sc.textFile("file:///root/t_word").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).sortBy(_._1,true,1).map(t=> t._1+"\t"+t._2).saveAsTextFile("hdfs:///demo/results02")
saveAsSequenceFile(path)
Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that implement Hadoop’s Writable interface. In Scala, it is also available on types that are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc).
该方法只能用于RDD[(k,v)]类型。并且K/v都必须实现Writable接口,由于使用Scala编程,Spark已经实现隐式转换将Int, Double, String, 等类型可以自动的转换为Writable
scala> sc.textFile("file:///root/t_word").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).sortBy(_._1,true,1).saveAsSequenceFile("hdfs:///demo/results03")
scala> sc.sequenceFile[String,Int]("hdfs:///demo/results03").collect
res29: Array[(String, Int)] = Array((a,1), (baby,1), (come,1), (day,2), (demo,1), (good,2), (hello,1), (is,1), (on,1), (spark,1), (study,1), (this,1), (up,1))
共享变量
当RDD中的转换算子需要用到定义Driver中的变量的时候,计算节点在运行该转换算子之前,会通过网络将Driver中定义的变量下载到计算节点。同时如果计算节点在修改了下载的变量,该修改对Driver端定义的变量不可见。
scala> var i:Int=0
i: Int = 0
scala> sc.textFile("file:///root/t_word").foreach(line=> i=i+1)
scala> print(i)
0
√广播变量
问题:
当出现超大数据集和小数据集合进行join的时候,能否使用join算子直接进行jion,如果不行为什么?
//100GB
var orderItems=List("001 apple 2 4.5","002 pear 1 2.0","001 瓜子 1 7.0")
//10MB
var users=List("001 zhangsan","002 lisi","003 王五")
var rdd1:RDD[(String,String)] =sc.makeRDD(orderItems).map(line=>(line.split(" ")(0),line))
var rdd2:RDD[(String,String)] =sc.makeRDD(users).map(line=>(line.split(" ")(0),line))
rdd1.join(rdd2).collect().foreach(println)
系统在做join的操作的时候会产生shuffle,会在各个计算节点当中传输100GB的数据用于完成join操作,因此join网络代价和内存代价都很高。因此可以考虑将小数据定义成Driver中成员变量,在Map操作的时候完成join。
scala> var users=List("001 zhangsan","002 lisi","003 王五").map(line=>line.split(" ")).map(ts=>ts(0)->ts(1)).toMap
users: scala.collection.immutable.Map[String,String] = Map(001 -> zhangsan, 002 -> lisi, 003 -> 王五)
scala> var orderItems=List("001 apple 2 4.5","002 pear 1 2.0","001 瓜子 1 7.0")
orderItems: List[String] = List(001 apple 2 4.5, 002 pear 1 2.0, 001 瓜子 1 7.0)
scala> var rdd1:RDD[(String,String)] =sc.makeRDD(orderItems).map(line=>(line.split(" ")(0),line))
rdd1: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[89] at map at <console>:32
scala> rdd1.map(t=> t._2+"\t"+users.get(t._1).getOrElse("未知")).collect()
res33: Array[String] = Array(001 apple 2 4.5 zhangsan, 002 pear 1 2.0 lisi, 001 瓜子 1 7.0 zhangsan)
但是上面写法会存在一个问题,每当一个map算子遍历元素的时候都会向Driver下载users变量,虽然该值不大,但是在计算节点会频繁的下载。正是因为此种情景会导致没有必要的重复变量的拷贝,Spark提出广播变量。
Spark 在程序运行前期,提前将需要广播的变量通知给所有的计算节点,计算节点会对需要广播的变量在计算之前进行下载操作并且将该变量缓存,该计算节点其他线程在使用到该变量的时候就不需要下载。
//100GB
var orderItems=List("001 apple 2 4.5","002 pear 1 2.0","001 瓜子 1 7.0")
//10MB 声明Map类型变量
var users:Map[String,String]=List("001 zhangsan","002 lisi","003 王五").map(line=>line.split(" ")).map(ts=>ts(0)->ts(1)).toMap
//声明广播变量,调用value属性获取广播值
val ub = sc.broadcast(users)
var rdd1:RDD[(String,String)] =sc.makeRDD(orderItems).map(line=>(line.split(" ")(0),line))
rdd1.map(t=> t._2+"\t"+ub.value.get(t._1).getOrElse("未知")).collect().foreach(println)
计数器
Spark提供的Accumulator,主要用于多个节点对一个变量进行共享性的操作。Accumulator只提供了累加的功能。但是确给我们提供了多个task对一个变量并行操作的功能。但是task只能对Accumulator进行累加操作,不能读取它的值。只有Driver程序可以读取Accumulator的值。
scala> val accum = sc.longAccumulator("mycount")
accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 1075, name: Some(mycount), value: 0)
scala> sc.parallelize(Array(1, 2, 3, 4),6).foreach(x => accum.add(x))
scala> accum.value
res36: Long = 10
Spark数据写出
将数据写出HDFS
scala> sc.textFile("file:///root/t_word").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).sortBy(_._1,true,1).saveAsSequenceFile("hdfs:///demo/results03")
因为saveASxxx都是将计算结果写入到HDFS或者是本地文件系统中,因此如果需要 将计算结果写出到第三方数据数据库此时就需要借助于spark给我们提供的一个算子foreach
算子写出。
√foreach写出
场景1:频繁的打开和关闭链接,写入效率很低(可以运行成功的)
sc.textFile("file:///root/t_word")
.flatMap(_.split(" "))
.map((_,1))
.reduceByKey(_+_)
.sortBy(_._1,true,3)
.foreach(tuple=>{ //数据库
//1,创建链接
//2.开始插入
//3.关闭链接
})
场景2:错误写法,因为链接池不可能被序列化(运行失败)
//1.定义连接Connection
var conn=... //定义在Driver
sc.textFile("file:///root/t_word")
.flatMap(_.split(" "))
.map((_,1))
.reduceByKey(_+_)
.sortBy(_._1,true,3)
.foreach(tuple=>{ //数据库
//2.开始插入
})
//3.关闭链接
场景3:一个分区一个链接池?(还不错,但是不是最优),有可能一个JVM运行多个分区,也就意味着一个JVM创建多个链接造成资源的浪费。单例对象?
sc.textFile("file:///root/t_word")
.flatMap(_.split(" "))
.map((_,1))
.reduceByKey(_+_)
.sortBy(_._1,true,3)
.foreachPartition(values=>{
//创建链接
//写入分区数据
//关闭链接
})
将创建链接代码使用单例对象创建,如果一个计算节点拿到多个分区。通过JVM单例定义可以知道,在整个JVM中仅仅只会创建一次。
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("SparkWordCountApplication")
val sc = new SparkContext(conf)
sc.textFile("hdfs://CentOS:9000/demo/words/")
.flatMap(_.split(" "))
.map((_,1))
.reduceByKey(_+_)
.sortBy(_._1,true,3)
.foreachPartition(values=>{
HbaseSink.writeToHbase("baizhi:t_word",values.toList)
})
sc.stop()
object HbaseSink {
lazy val conn:Connection=createConnection()
def createConnection(): Connection = {
val hadoopConf = new Configuration()
hadoopConf.set(HConstants.ZOOKEEPER_QUORUM,"CentOS")
ConnectionFactory.createConnection(hadoopConf)
}
/**
* @param tableName
* @param values
*/
def writeToHbase(tableName: String, values: List[(String, Int)]): Unit = {
val bufferedMutator = conn.getBufferedMutator(TableName.valueOf(tableName))
val puts: List[Put] = values.map(t => {
val put = new Put(t._1.getBytes())
put.addColumn("cf1".getBytes(), "count".getBytes(), (t._2 + " ").getBytes())
put
})
//批量写出
bufferedMutator.mutate(puts.asJava)
bufferedMutator.flush()
bufferedMutator.close()
}
sys.addShutdownHook({
println("虚拟机退出!")
if(conn!=null){
conn.close()
}
})
}