文章目录

  • Transformation算子
  • 基本算子
  • 1. map(func)
  • 2. filter(func)
  • 3. flatMap
  • 4. 集合运算(union、intersection、distinct)
  • 5. 分组(groupByKey、reduceByKey、cogroup)
  • 6. 排序(sortBy、sortByKey)
  • 高级算子
  • 1. mapPartitionsWithIndex(func)

  • 2. aggregate
  • 3. aggregateByKey(zeroValue)(seqOp,combOp,[numTasks])
  • 4. 重分区:coalesce(numPartitions)与repartition(numPartitions)
  • 5. 其他高级算子


Transformation算子

RDD中的所有转换都是延迟加载的,也就是说,它们并不会直接计算结果。相反的,它们只是记住这些应用到基础数据集(例如一个文件)上的转换动作。只有当发生一个要求返回结果给Driver的动作时,这些转换才会真正运行。这种设计让Spark更加有效率地运行。

转换

含义

map(func)

返回一个新的RDD,该RDD由每一个输入元素经过func函数转换后组成

filter(func)

过滤。返回一个新的RDD,该RDD由经过func函数计算后返回值为true的输入元素组成

flatMap(func)

压平。类似于map,但是每一个输入元素可以被映射为0或多个输出元素(所以func应该返回一个序列,而不是单一元素)

mapPartitions(func)

类似于map,但独立地在RDD的每一个分片上运行,因此在类型为T的RDD上运行时,func的函数类型必须是Iterator[T] => Iterator[U]。对RDD中的每个分区进行操作

mapPartitionsWithIndex(func)

类似于mapPartitions,但func带有一个整数参数表示分片的索引值,因此在类型为T的RDD上运行时,func的函数类型必须是(Int, Interator[T]) => Iterator[U]。对RDD中的每个分区进行操作,带有下标,可以取到分区号。

sample(withReplacement, fraction, seed)

根据fraction指定的比例对数据进行采样,可以选择是否使用随机数进行替换,seed用于指定随机数生成器种子

union(otherDataset)

集合运算。对源RDD和参数RDD求并集后返回一个新的RDD

intersection(otherDataset)

集合运算。对源RDD和参数RDD求交集后返回一个新的RDD

distinct([numTasks])

集合运算。对源RDD进行去重后返回一个新的RDD

groupByKey([numTasks])

分组操作。在一个(K,V)的RDD上调用,返回一个(K, Iterator[V])的RDD。偏底层

reduceByKey(func, [numTasks])

分组操作。在一个(K,V)的RDD上调用,返回一个(K,V)的RDD,使用指定的reduce函数,将相同key的值聚合到一起,与groupByKey类似,reduce任务的个数可以通过第二个可选的参数来设置。调用groupByKey。

aggregateByKey(zeroValue)(seqOp,combOp,[numTasks])

分组操作。调用 groupByKey,常用。

sortByKey([ascending], [numTasks])

排序。在一个(K,V)的RDD上调用,K必须实现Ordered接口,返回一个按照key进行排序的(K,V)的RDD

sortBy(func,[ascending], [numTasks])

排序。与sortByKey类似,但是更灵活

join(otherDataset, [numTasks])

在类型为(K,V)和(K,W)的RDD上调用,返回一个相同key对应的所有元素对在一起的(K,(V,W))的RDD

cogroup(otherDataset, [numTasks])

在类型为(K,V)和(K,W)的RDD上调用,返回一个(K,(Iterable,Iterable))类型的RDD

cartesian(otherDataset)

笛卡尔积

pipe(command, [envVars])

coalesce(numPartitions)

repartition(numPartitions)

重分区

repartitionAndSortWithinPartitions(partitioner)

重分区

基本算子

1. map(func)

返回一个新的RDD,该RDD由每一个输入元素经过func函数转换后组成

// 创建一个RDD数字类型
scala> val rdd1 = sc.parallelize(List(5,6,7,8,9,1,2,3,100))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24

scala> rdd1.partitions.length
res1: Int = 2

// 需求:每个元素乘以2
scala> val rdd2 = rdd1.map(_*2)
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[4] at map at <console>:26

scala> rdd2.collect
res2: Array[Int] = Array(10, 12, 14, 16, 18, 2, 4, 6, 200)

2. filter(func)

过滤。返回一个新的RDD,该RDD由经过func函数计算后返回值为true的输入元素组成

// 需求:过滤出大于20的元素
scala> val rdd3 = rdd2.filter(_>20)
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[17] at filter at <console>:28

scala> rdd3.collect
res5: Array[Int] = Array(200)

3. flatMap

压平。类似于map,但是每一个输入元素可以被映射为0或多个输出元素(所以func应该返回一个序列,而不是单一元素)

// 创建RDD,包含字符串(字符)
scala> val rdd4 = sc.parallelize(Array("a b c","d e f","x y z"))
rdd4: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[18] at parallelize at <console>:24

scala> val rdd5 = rdd4.flatMap(_.split(" "))
rdd5: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[19] at flatMap at <console>:26

scala> rdd5.collect
res6: Array[String] = Array(a, b, c, d, e, f, x, y, z)

4. 集合运算(union、intersection、distinct)

  • union(otherDataset) - 对源RDD和参数RDD求并集后返回一个新的RDD
  • intersection(otherDataset) - 对源RDD和参数RDD求交集后返回一个新的RDD
  • distinct([numTasks]) - 对源RDD进行去重后返回一个新的RDD
// 集合运算、去重(需要两个RDD)
scala> val rdd6 = sc.parallelize(List(5,6,7,8,1,2,3,100))
rdd6: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[20] at parallelize at <console>:24

scala> val rdd7 = sc.parallelize(List(1,2,3,4))
rdd7: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[21] at parallelize at <console>:24

scala> val rdd8 = rdd6.union(rdd7)
rdd8: org.apache.spark.rdd.RDD[Int] = UnionRDD[22] at union at <console>:28

scala> rdd8.collect
res7: Array[Int] = Array(5, 6, 7, 8, 1, 2, 3, 100, 1, 2, 3, 4)

scala> rdd8.distinct.collect
res8: Array[Int] = Array(100, 4, 8, 1, 5, 6, 2, 7, 3)

scala> val rdd9 = rdd6.intersection(rdd7)
rdd9: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[31] at intersection at <console>:28

scala> rdd9.collect
res9: Array[Int] = Array(2, 1, 3)

5. 分组(groupByKey、reduceByKey、cogroup)

  • groupByKey([numTasks]) - 在一个(K,V)的RDD上调用,返回一个(K, Iterator[V])的RDD。偏底层
  • reduceByKey(func, [numTasks]) - 在一个(K,V)的RDD上调用,返回一个(K,V)的RDD,使用指定的reduce函数,将相同key的值聚合到一起,与groupByKey类似,reduce任务的个数可以通过第二个可选的参数来设置。调用groupByKey。
  • cogroup(otherDataset, [numTasks]) - 在类型为(K,V)和(K,W)的RDD上调用,返回一个(K,(Iterable,Iterable))类型的RDD
// 分组操作:reduceByKey groupByKey

scala> val rdd1 = sc.parallelize(List(("Tom",1000),("Jerry",3000),("Mery",2000)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[32] at parallelize at <console>:24

scala> val rdd2 = sc.parallelize(List(("Jerry",1000),("Tom",3000),("Mike",2000)))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[33] at parallelize at <console>:24

scala> val rdd3 = rdd1 union rdd2
rdd3: org.apache.spark.rdd.RDD[(String, Int)] = UnionRDD[34] at union at <console>:28

scala> rdd3.collect
res10: Array[(String, Int)] = Array((Tom,1000), (Jerry,3000), (Mery,2000), (Jerry,1000), (Tom,3000), (Mike,2000))

scala> val rdd4 = rdd3.groupByKey
rdd4: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[35] at groupByKey at <console>:30

scala> rdd4.collect
res11: Array[(String, Iterable[Int])] = 
Array(
(Tom,CompactBuffer(1000, 3000)), 
(Jerry,CompactBuffer(3000, 1000)), 
(Mike,CompactBuffer(2000)), 
(Mery,CompactBuffer(2000)))

scala> rdd3.reduceByKey(_+_).collect
res12: Array[(String, Int)] = Array((Tom,4000), (Jerry,4000), (Mike,2000), (Mery,2000))

// cogroup操作

scala> val rdd1 = sc.parallelize(List(("Tom",1),("Tom",2),("Jerry",3),("Kitty",2)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[37] at parallelize at <console>:24

scala> val rdd2 = sc.parallelize(List(("Jerry",2),("Tom",1),("Andy",2)))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[38] at parallelize at <console>:24

scala> val rdd3 = rdd1.cogroup(rdd2)
rdd3: org.apache.spark.rdd.RDD[(String, (Iterable[Int], Iterable[Int]))] = MapPartitionsRDD[40] at cogroup at <console>:28

scala> rdd3.collect
res13: Array[(String, (Iterable[Int], Iterable[Int]))] = 
Array((Tom,(CompactBuffer(1, 2),CompactBuffer(1))), 
(Jerry,(CompactBuffer(3),CompactBuffer(2))), 
(Andy,(CompactBuffer(),CompactBuffer(2))), 
(Kitty,(CompactBuffer(2),CompactBuffer())))

// cogroup和groupByKey的区别:
// 例1:
groupByKey([numTasks]),在一个(K,V)的RDD上调用,返回一个(K, Iterator[V])的RDD。
cogroup(otherDataset, [numTasks]),在类型为(K,V)和(K,W)的RDD上调用,返回一个(K,(Iterable<V>,Iterable<W>))类型的RDD

scala> val rdd0 = sc.parallelize(Array((1,1), (1,2) , (1,3) , (2,1) , (2,2) , (2,3)), 3)
rdd0: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> val rdd1 = rdd0.groupByKey()
rdd1: org.apache.spark.rdd.RDD[(Int, Iterable[Int])] = ShuffledRDD[1] at groupByKey at <console>:25

scala> rdd1.collect
res0: Array[(Int, Iterable[Int])] = Array((1,CompactBuffer(1, 2, 3)), (2,CompactBuffer(1, 2, 3)))

scala> val rdd2 = rdd0.cogroup(rdd0)
rdd2: org.apache.spark.rdd.RDD[(Int, (Iterable[Int], Iterable[Int]))] = MapPartitionsRDD[3] at cogroup at <console>:25

scala> rdd2.collect
res1: Array[(Int, (Iterable[Int], Iterable[Int]))] = Array((1,(CompactBuffer(1, 2, 3),CompactBuffer(1, 2, 3))), (2,(CompactBuffer(1, 2, 3),CompactBuffer(1, 2, 3))))

// 例2:
scala> b.collect
res3: Array[(Int, String)] = Array((1,b), (2,b), (1,b), (3,b))

scala> c.collect
res4: Array[(Int, String)] = Array((1,c), (2,c), (1,c), (3,c))

scala> b.cogroup(c).collect
res2: Array[(Int, (Iterable[String], Iterable[String]))] = Array((1,(CompactBuffer(b, b),CompactBuffer(c, c))), (3,(CompactBuffer(b),CompactBuffer(c))), (2,(CompactBuffer(b),CompactBuffer(c))))

scala> var rdd4=b union c
rdd4: org.apache.spark.rdd.RDD[(Int, String)] = UnionRDD[9] at union at <console>:27

scala> rdd4.collect
res6: Array[(Int, String)] = Array((1,b), (2,b), (1,b), (3,b), (1,c), (2,c), (1,c), (3,c))

scala> rdd4.groupByKey().collect
res5: Array[(Int, Iterable[String])] = Array((2,CompactBuffer(b, c)), (1,CompactBuffer(b, b, c, c)), (3,CompactBuffer(b, c)))

6. 排序(sortBy、sortByKey)

  • sortBy(func,[ascending], [numTasks]) - 与sortByKey类似,但是更灵活
scala> val rdd2 = rdd1.map(_*2).sortBy(x=>x,true)
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[10] at sortBy at <console>:26

scala> rdd2.collect
res3: Array[Int] = Array(2, 4, 6, 10, 12, 14, 16, 18, 200)                      

scala> val rdd2 = rdd1.map(_*2).sortBy(x=>x,false)
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[16] at sortBy at <console>:26

scala> rdd2.collect
res4: Array[Int] = Array(200, 18, 16, 14, 12, 10, 6, 4, 2)
  • sortByKey([ascending], [numTasks]) - 在一个(K,V)的RDD上调用,K必须实现Ordered接口,返回一个按照key进行排序的(K,V)的RDD
// 7、需求:按照value进行排序。
// 注意:SortByKey 按照 Key 排序
// 做法:把Key和Value交换位置
// 1、把Key Value 交换,调用SortByKey
// 2、调用完后交换回来

scala> val rdd1 = sc.parallelize(List(("Tom",1),("Andy",2),("Jerry",4),("Mike",5)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[42] at parallelize at <console>:24

scala> val rdd2 = sc.parallelize(List(("Jerry",1),("Tom",2),("Mike",4),("Kitty",5)))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[43] at parallelize at <console>:24

scala> val rdd3 = rdd1 union rdd2
rdd3: org.apache.spark.rdd.RDD[(String, Int)] = UnionRDD[44] at union at <console>:28

scala> rdd3.collect
res14: Array[(String, Int)] = Array((Tom,1), (Andy,2), (Jerry,4), (Mike,5), (Jerry,1), (Tom,2), (Mike,4), (Kitty,5))

scala> val rdd4 = rdd3.reduceByKey(_+_)
rdd4: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[45] at reduceByKey at <console>:30

scala> rdd4.collect
res15: Array[(String, Int)] = Array((Tom,3), (Jerry,5), (Andy,2), (Mike,9), (Kitty,5))

scala> val rdd5 = rdd4.map(t => (t._2,t._1)).sortByKey(false).map(t=>(t._2,t._1))
rdd5: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[50] at map at <console>:32

scala> rdd5.collect
res16: Array[(String, Int)] = Array((Mike,9), (Jerry,5), (Kitty,5), (Tom,3), (Andy,2))

// 解析执行过程:
// rdd4.map(t => (t._2,t._1))   --> (3,Tom), (5,Jerry), (2,Andy), (9,Mike), (5,Kitty)
// .sortByKey(false)  --> (9,Mike),(5,Jerry),(5,Kitty),(3,Tom),(2,Andy)
// .map(t=>(t._2,t._1))--> (Mike,9), (Jerry,5), (Kitty,5), (Tom,3), (Andy,2)

高级算子

1. mapPartitionsWithIndex(func)

类似于mapPartitions,但func带有一个整数参数表示分片的索引值,因此在类型为T的RDD上运行时,func的函数类型必须是(Int, Interator[T]) => Iterator[U]。对RDD中的每个分区进行操作,带有下标,可以取到分区号。

def mapPartitionsWithIndex[U](f: (Int, Iterator[T]) ⇒ Iterator[U])
  • 对RDD中的每个分区(带有下标)进行操作,下标用index来表示
  • 通过这个算子可以获取分区号
  • 定义一个函数,对分区进行处理
    f 接收两个参数,第一个参数 代表分区号。第二个代表分区中的元素。Iterator[U] 处理完后的结果

举例:

scala> val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9),3)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24

scala> def fun1(index:Int,iter:Iterator[Int]) : Iterator[String] = {
     | iter.toList.map(x => "[PartId: " + index + " , value = " + x + " ]").iterator
     | }
fun1: (index: Int, iter: Iterator[Int])Iterator[String]

scala> rdd1.mapPartitionsWithIndex(fun1).collect
res3: Array[String] = Array(
[PartId: 0 , value = 1 ], 
[PartId: 0 , value = 2 ], 
[PartId: 0 , value = 3 ], 
[PartId: 1 , value = 4 ], 
[PartId: 1 , value = 5 ], 
[PartId: 1 , value = 6 ], 
[PartId: 2 , value = 7 ], 
[PartId: 2 , value = 8 ], 
[PartId: 2 , value = 9 ])

2. aggregate

先对局部数据进行聚合操作,然后再对全局数据进行聚合操作

Spark(四)—— Transformation算子_transformation

举例:

(1)先局部求最大值,再全局相加

scala> val rdd1 = sc.parallelize(List(1,2,3,4,5),2)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> def fun1(index:Int,iter:Iterator[Int]) : Iterator[String] = {
     | iter.toList.map( x => "[PartId: " + index + " , value = " + x + "]").iterator
     | }
fun1: (index: Int, iter: Iterator[Int])Iterator[String]

scala> rdd1.mapPartitionsWithIndex(fun1).collect
res0: Array[String] = Array([PartId: 0 , value = 1], [PartId: 0 , value = 2], [PartId: 1 , value = 3], [PartId: 1 , value = 4], [PartId: 1 , value = 5])

scala> import scala.math._
import scala.math._

// 先对局部操作求最大值,再对全局相加 2+5=7
scala> rdd1.aggregate(0)(max(_,_),_+_)
res2: Int = 7

// 修改初始值为10 10+10+10=30
scala> rdd1.aggregate(10)(max(_,_),_+_)
res3: Int = 30
 
// 分析结果:
// 初始值是10,代表每个分区中,都多了一个10.
// 局部操作,每个分区的最大值都是10.
// 全局操作,10也要在全局操作时生效,即 10 + 10 + 10 = 30

(2)先局部相加,再全局相加

将 RDD的元素求和:
方式一:RDD.map
方式二:使用聚合操作

scala> val rdd1 = sc.parallelize(List(1,2,3,4,5),2)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> def fun1(index:Int,iter:Iterator[Int]) : Iterator[String] = {
     | iter.toList.map( x => "[PartId: " + index + " , value = " + x + "]").iterator
     | }
fun1: (index: Int, iter: Iterator[Int])Iterator[String]

scala> rdd1.mapPartitionsWithIndex(fun1).collect
res0: Array[String] = Array([PartId: 0 , value = 1], [PartId: 0 , value = 2], [PartId: 1 , value = 3], [PartId: 1 , value = 4], [PartId: 1 , value = 5])
 
// 先局部相加,再全局相加 3+12=15
scala> rdd1.aggregate(0)(_+_,_+_)
res5: Int = 15
 
// 10+(10+1+2)+(10+3+4+5) = 10+13+22 = 45
scala> rdd1.aggregate(10)(_+_,_+_)
res6: Int = 45

(3)字符串相加

scala> val rdd2 = sc.parallelize(List("a","b","c","d","e","f"),2)
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[2] at parallelize at <console>:27

scala> def fun1(index:Int,iter:Iterator[String]) : Iterator[String] = {
     | iter.toList.map( x => "[PartId: " + index + " , value = " + x + "]").iterator
     | }
fun1: (index: Int, iter: Iterator[String])Iterator[String]

scala> rdd2.mapPartitionsWithIndex(fun1).collect
res8: Array[String] = Array(
[PartId: 0 , value = a], [PartId: 0 , value = b], [PartId: 0 , value = c], 
[PartId: 1 , value = d], [PartId: 1 , value = e], [PartId: 1 , value = f])
 
scala> rdd2.aggregate("")(_+_,_+_)
res10: String = abcdef

scala> rdd2.aggregate("")(_+_,_+_)
res11: String = defabc

// 注:谁快谁在前面

scala> rdd2.aggregate("*")(_+_,_+_)
res13: String = **def*abc

scala> rdd2.aggregate("*")(_+_,_+_)
res14: String = **abc*def

(4)复杂的例子

scala> val rdd3 = sc.parallelize(List("12","23","345","4567"),2)
rdd3: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[4] at parallelize at <console>:27

scala> rdd3.mapPartitionsWithIndex(fun1)
res15: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[5] at mapPartitionsWithIndex at <console>:32

scala> rdd3.mapPartitionsWithIndex(fun1).collect
res16: Array[String] = Array(
[PartId: 0 , value = 12], [PartId: 0 , value = 23], 
[PartId: 1 , value = 345], [PartId: 1 , value = 4567])
 
scala> rdd3.aggregate("")((x,y) => math.max(x.length,y.length).toString,(x,y)=>x+y)
res20: String = 24

scala> rdd3.aggregate("")((x,y) => math.max(x.length,y.length).toString,(x,y)=>x+y)
res21: String = 42

分析:
第一个分区:
第一次比较:"" 和 “12” 比,求长度最大值:2 。2 —> “2”.
第二次比较:“2” 和 “23” 比,求长度最大值:2 。2 —> “2”.

第二个分区:
第一次比较:"" 和 “345” 比,求长度最大值:3 。3 —> “3”.
第二次比较:“3” 和 “4567” 比,求长度最大值:4 。4 —> “4”.

scala> val rdd3 = sc.parallelize(List("12","23","345",""),2)
rdd3: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[7] at parallelize at <console>:27

scala> rdd3.aggregate("")((x,y) => math.min(x.length,y.length).toString,(x,y)=>x+y)
res22: String = 01

scala> rdd3.aggregate("")((x,y) => math.min(x.length,y.length).toString,(x,y)=>x+y)
res23: String = 01

scala> rdd3.aggregate("")((x,y) => math.min(x.length,y.length).toString,(x,y)=>x+y)
res24: String = 01

scala> rdd3.aggregate("")((x,y) => math.min(x.length,y.length).toString,(x,y)=>x+y)
res25: String = 10

分析:
注意有个初始值””,其长度0,然后0.toString变成字符串"0","0"的长度是1。

第一个分区:
第一次比较:"" 和 “12” 比,求长度最小值:0 。0 —> “0”.
第二次比较:“0” 和 “23” 比,求长度最小值:1 。1 —> “1”.

第二个分区:
第一次比较:"" 和 “345” 比,求长度最小值:0 。0 —> “0”.
第二次比较:“0” 和 “” 比,求长度最小值:0 。0 —> “0”.

scala> val rdd3 = sc.parallelize(List("12","23","","345"),2)
rdd3: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at parallelize at <console>:27

scala> rdd3.aggregate("")((x,y) => math.min(x.length,y.length).toString,(x,y)=>x+y)
res26: String = 11

分析:
第一个分区:
第一次比较:"" 和 “12” 比,求长度最小值:0 。0 —> “0”.
第二次比较:“0” 和 “23” 比,求长度最小值:1 。1 —> “1”.

第二个分区:
第一次比较:"" 和 “” 比,求长度最小值:0 。0 —> “0”.
第二次比较:“0” 和 “345” 比,求长度最小值:1 。1 —> “1”.

3. aggregateByKey(zeroValue)(seqOp,combOp,[numTasks])

分组操作。调用 groupByKey,常用。类似于aggregate,操作<Key Value>的数据类型。

举例:

scala> val pairRDD = sc.parallelize(List(("cat",2),("cat",5),("mouse",4),("cat",12),("dog",12),("mouse",2)),2)
pairRDD: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[9] at parallelize at <console>:27

scala> def fun3(index:Int,iter:Iterator[(String,Int)]): Iterator[String] = {
     | iter.toList.map(x => "[partID:" +  index + ", val: " + x + "]").iterator
     | }
fun3: (index: Int, iter: Iterator[(String, Int)])Iterator[String]

scala> pairRDD.mapPartitionsWithIndex(fun3).collect
res27: Array[String] = Array(
[partID:0, val: (cat,2)], [partID:0, val: (cat,5)], [partID:0, val: (mouse,4)], 
[partID:1, val: (cat,12)], [partID:1, val: (dog,12)], [partID:1, val: (mouse,2)])

// 将每个分区中的动物最多的个数求和
scala> pairRDD.aggregateByKey(0)(math.max(_,_),_+_).collect
res28: Array[(String, Int)] = Array((dog,12), (cat,17), (mouse,6))
 
// 程序分析:
// 0:(cat,5)(mouse,4)
// 1:(cat,12)(dog,12)(mouse,2)
// (cat,17)  (mouse,6)  (dog,12)

// 将每种动物个数求和
scala> pairRDD.aggregateByKey(0)(_+_, _+_).collect
res71: Array[(String, Int)] = Array((dog,12), (cat,19), (mouse,6))
    
// 这个例子也可以使用:reduceByKey
scala> pairRDD.reduceByKey(_+_).collect
res73: Array[(String, Int)] = Array((dog,12), (cat,19), (mouse,6))

4. 重分区:coalesce(numPartitions)与repartition(numPartitions)

这两个算子都与分区有关系,都是对RDD进行重分区。创建RDD时可以创建分区个数。

区别:
(1)coalesce : 默认不会进行shuffle。
(2)repartition:对数据真正进行shuffle(在网络上重新分区)

scala> val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9),2)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[12] at parallelize at <console>:27

scala> val rdd2 = rdd1.repartition(3)
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[16] at repartition at <console>:29

scala> rdd2.partitions.length
res29: Int = 3

scala> val rdd3 = rdd1.coalesce(3)
rdd3: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[17] at coalesce at <console>:29

scala> rdd3.partitions.length
res30: Int = 2

scala> val rdd3 = rdd1.coalesce(3,true)
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[21] at coalesce at <console>:29

scala> rdd3.partitions.length
res31: Int = 3

// 下面两句话是等价的:
val rdd2 = rdd1.repartition(3)
val rdd3 = rdd1.coalesce(3,true) // -->如果是false,查看RDD的length依然是2。默认不写是false。

5. 其他高级算子

http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html