文章目录
- Transformation算子
- 基本算子
- 1. map(func)
- 2. filter(func)
- 3. flatMap
- 4. 集合运算(union、intersection、distinct)
- 5. 分组(groupByKey、reduceByKey、cogroup)
- 6. 排序(sortBy、sortByKey)
- 高级算子
- 1. mapPartitionsWithIndex(func)
- 2. aggregate
- 3. aggregateByKey(zeroValue)(seqOp,combOp,[numTasks])
- 4. 重分区:coalesce(numPartitions)与repartition(numPartitions)
- 5. 其他高级算子
Transformation算子
RDD中的所有转换都是延迟加载的,也就是说,它们并不会直接计算结果。相反的,它们只是记住这些应用到基础数据集(例如一个文件)上的转换动作。只有当发生一个要求返回结果给Driver的动作时,这些转换才会真正运行。这种设计让Spark更加有效率地运行。
转换 | 含义 |
map(func) | 返回一个新的RDD,该RDD由每一个输入元素经过func函数转换后组成 |
filter(func) | 过滤。返回一个新的RDD,该RDD由经过func函数计算后返回值为true的输入元素组成 |
flatMap(func) | 压平。类似于map,但是每一个输入元素可以被映射为0或多个输出元素(所以func应该返回一个序列,而不是单一元素) |
mapPartitions(func) | 类似于map,但独立地在RDD的每一个分片上运行,因此在类型为T的RDD上运行时,func的函数类型必须是Iterator[T] => Iterator[U]。对RDD中的每个分区进行操作 |
mapPartitionsWithIndex(func) | 类似于mapPartitions,但func带有一个整数参数表示分片的索引值,因此在类型为T的RDD上运行时,func的函数类型必须是(Int, Interator[T]) => Iterator[U]。对RDD中的每个分区进行操作,带有下标,可以取到分区号。 |
sample(withReplacement, fraction, seed) | 根据fraction指定的比例对数据进行采样,可以选择是否使用随机数进行替换,seed用于指定随机数生成器种子 |
union(otherDataset) | 集合运算。对源RDD和参数RDD求并集后返回一个新的RDD |
intersection(otherDataset) | 集合运算。对源RDD和参数RDD求交集后返回一个新的RDD |
distinct([numTasks]) | 集合运算。对源RDD进行去重后返回一个新的RDD |
groupByKey([numTasks]) | 分组操作。在一个(K,V)的RDD上调用,返回一个(K, Iterator[V])的RDD。偏底层 |
reduceByKey(func, [numTasks]) | 分组操作。在一个(K,V)的RDD上调用,返回一个(K,V)的RDD,使用指定的reduce函数,将相同key的值聚合到一起,与groupByKey类似,reduce任务的个数可以通过第二个可选的参数来设置。调用groupByKey。 |
aggregateByKey(zeroValue)(seqOp,combOp,[numTasks]) | 分组操作。调用 groupByKey,常用。 |
sortByKey([ascending], [numTasks]) | 排序。在一个(K,V)的RDD上调用,K必须实现Ordered接口,返回一个按照key进行排序的(K,V)的RDD |
sortBy(func,[ascending], [numTasks]) | 排序。与sortByKey类似,但是更灵活 |
join(otherDataset, [numTasks]) | 在类型为(K,V)和(K,W)的RDD上调用,返回一个相同key对应的所有元素对在一起的(K,(V,W))的RDD |
cogroup(otherDataset, [numTasks]) | 在类型为(K,V)和(K,W)的RDD上调用,返回一个(K,(Iterable,Iterable))类型的RDD |
cartesian(otherDataset) | 笛卡尔积 |
pipe(command, [envVars]) | |
coalesce(numPartitions) | |
repartition(numPartitions) | 重分区 |
repartitionAndSortWithinPartitions(partitioner) | 重分区 |
基本算子
1. map(func)
返回一个新的RDD,该RDD由每一个输入元素经过func函数转换后组成
// 创建一个RDD数字类型
scala> val rdd1 = sc.parallelize(List(5,6,7,8,9,1,2,3,100))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24
scala> rdd1.partitions.length
res1: Int = 2
// 需求:每个元素乘以2
scala> val rdd2 = rdd1.map(_*2)
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[4] at map at <console>:26
scala> rdd2.collect
res2: Array[Int] = Array(10, 12, 14, 16, 18, 2, 4, 6, 200)
2. filter(func)
过滤。返回一个新的RDD,该RDD由经过func函数计算后返回值为true的输入元素组成
// 需求:过滤出大于20的元素
scala> val rdd3 = rdd2.filter(_>20)
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[17] at filter at <console>:28
scala> rdd3.collect
res5: Array[Int] = Array(200)
3. flatMap
压平。类似于map,但是每一个输入元素可以被映射为0或多个输出元素(所以func应该返回一个序列,而不是单一元素)
// 创建RDD,包含字符串(字符)
scala> val rdd4 = sc.parallelize(Array("a b c","d e f","x y z"))
rdd4: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[18] at parallelize at <console>:24
scala> val rdd5 = rdd4.flatMap(_.split(" "))
rdd5: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[19] at flatMap at <console>:26
scala> rdd5.collect
res6: Array[String] = Array(a, b, c, d, e, f, x, y, z)
4. 集合运算(union、intersection、distinct)
- union(otherDataset) - 对源RDD和参数RDD求并集后返回一个新的RDD
- intersection(otherDataset) - 对源RDD和参数RDD求交集后返回一个新的RDD
- distinct([numTasks]) - 对源RDD进行去重后返回一个新的RDD
// 集合运算、去重(需要两个RDD)
scala> val rdd6 = sc.parallelize(List(5,6,7,8,1,2,3,100))
rdd6: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[20] at parallelize at <console>:24
scala> val rdd7 = sc.parallelize(List(1,2,3,4))
rdd7: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[21] at parallelize at <console>:24
scala> val rdd8 = rdd6.union(rdd7)
rdd8: org.apache.spark.rdd.RDD[Int] = UnionRDD[22] at union at <console>:28
scala> rdd8.collect
res7: Array[Int] = Array(5, 6, 7, 8, 1, 2, 3, 100, 1, 2, 3, 4)
scala> rdd8.distinct.collect
res8: Array[Int] = Array(100, 4, 8, 1, 5, 6, 2, 7, 3)
scala> val rdd9 = rdd6.intersection(rdd7)
rdd9: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[31] at intersection at <console>:28
scala> rdd9.collect
res9: Array[Int] = Array(2, 1, 3)
5. 分组(groupByKey、reduceByKey、cogroup)
- groupByKey([numTasks]) - 在一个(K,V)的RDD上调用,返回一个(K, Iterator[V])的RDD。偏底层
- reduceByKey(func, [numTasks]) - 在一个(K,V)的RDD上调用,返回一个(K,V)的RDD,使用指定的reduce函数,将相同key的值聚合到一起,与groupByKey类似,reduce任务的个数可以通过第二个可选的参数来设置。调用groupByKey。
- cogroup(otherDataset, [numTasks]) - 在类型为(K,V)和(K,W)的RDD上调用,返回一个(K,(Iterable,Iterable))类型的RDD
// 分组操作:reduceByKey groupByKey
scala> val rdd1 = sc.parallelize(List(("Tom",1000),("Jerry",3000),("Mery",2000)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[32] at parallelize at <console>:24
scala> val rdd2 = sc.parallelize(List(("Jerry",1000),("Tom",3000),("Mike",2000)))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[33] at parallelize at <console>:24
scala> val rdd3 = rdd1 union rdd2
rdd3: org.apache.spark.rdd.RDD[(String, Int)] = UnionRDD[34] at union at <console>:28
scala> rdd3.collect
res10: Array[(String, Int)] = Array((Tom,1000), (Jerry,3000), (Mery,2000), (Jerry,1000), (Tom,3000), (Mike,2000))
scala> val rdd4 = rdd3.groupByKey
rdd4: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[35] at groupByKey at <console>:30
scala> rdd4.collect
res11: Array[(String, Iterable[Int])] =
Array(
(Tom,CompactBuffer(1000, 3000)),
(Jerry,CompactBuffer(3000, 1000)),
(Mike,CompactBuffer(2000)),
(Mery,CompactBuffer(2000)))
scala> rdd3.reduceByKey(_+_).collect
res12: Array[(String, Int)] = Array((Tom,4000), (Jerry,4000), (Mike,2000), (Mery,2000))
// cogroup操作
scala> val rdd1 = sc.parallelize(List(("Tom",1),("Tom",2),("Jerry",3),("Kitty",2)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[37] at parallelize at <console>:24
scala> val rdd2 = sc.parallelize(List(("Jerry",2),("Tom",1),("Andy",2)))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[38] at parallelize at <console>:24
scala> val rdd3 = rdd1.cogroup(rdd2)
rdd3: org.apache.spark.rdd.RDD[(String, (Iterable[Int], Iterable[Int]))] = MapPartitionsRDD[40] at cogroup at <console>:28
scala> rdd3.collect
res13: Array[(String, (Iterable[Int], Iterable[Int]))] =
Array((Tom,(CompactBuffer(1, 2),CompactBuffer(1))),
(Jerry,(CompactBuffer(3),CompactBuffer(2))),
(Andy,(CompactBuffer(),CompactBuffer(2))),
(Kitty,(CompactBuffer(2),CompactBuffer())))
// cogroup和groupByKey的区别:
// 例1:
groupByKey([numTasks]),在一个(K,V)的RDD上调用,返回一个(K, Iterator[V])的RDD。
cogroup(otherDataset, [numTasks]),在类型为(K,V)和(K,W)的RDD上调用,返回一个(K,(Iterable<V>,Iterable<W>))类型的RDD
scala> val rdd0 = sc.parallelize(Array((1,1), (1,2) , (1,3) , (2,1) , (2,2) , (2,3)), 3)
rdd0: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> val rdd1 = rdd0.groupByKey()
rdd1: org.apache.spark.rdd.RDD[(Int, Iterable[Int])] = ShuffledRDD[1] at groupByKey at <console>:25
scala> rdd1.collect
res0: Array[(Int, Iterable[Int])] = Array((1,CompactBuffer(1, 2, 3)), (2,CompactBuffer(1, 2, 3)))
scala> val rdd2 = rdd0.cogroup(rdd0)
rdd2: org.apache.spark.rdd.RDD[(Int, (Iterable[Int], Iterable[Int]))] = MapPartitionsRDD[3] at cogroup at <console>:25
scala> rdd2.collect
res1: Array[(Int, (Iterable[Int], Iterable[Int]))] = Array((1,(CompactBuffer(1, 2, 3),CompactBuffer(1, 2, 3))), (2,(CompactBuffer(1, 2, 3),CompactBuffer(1, 2, 3))))
// 例2:
scala> b.collect
res3: Array[(Int, String)] = Array((1,b), (2,b), (1,b), (3,b))
scala> c.collect
res4: Array[(Int, String)] = Array((1,c), (2,c), (1,c), (3,c))
scala> b.cogroup(c).collect
res2: Array[(Int, (Iterable[String], Iterable[String]))] = Array((1,(CompactBuffer(b, b),CompactBuffer(c, c))), (3,(CompactBuffer(b),CompactBuffer(c))), (2,(CompactBuffer(b),CompactBuffer(c))))
scala> var rdd4=b union c
rdd4: org.apache.spark.rdd.RDD[(Int, String)] = UnionRDD[9] at union at <console>:27
scala> rdd4.collect
res6: Array[(Int, String)] = Array((1,b), (2,b), (1,b), (3,b), (1,c), (2,c), (1,c), (3,c))
scala> rdd4.groupByKey().collect
res5: Array[(Int, Iterable[String])] = Array((2,CompactBuffer(b, c)), (1,CompactBuffer(b, b, c, c)), (3,CompactBuffer(b, c)))
6. 排序(sortBy、sortByKey)
- sortBy(func,[ascending], [numTasks]) - 与sortByKey类似,但是更灵活
scala> val rdd2 = rdd1.map(_*2).sortBy(x=>x,true)
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[10] at sortBy at <console>:26
scala> rdd2.collect
res3: Array[Int] = Array(2, 4, 6, 10, 12, 14, 16, 18, 200)
scala> val rdd2 = rdd1.map(_*2).sortBy(x=>x,false)
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[16] at sortBy at <console>:26
scala> rdd2.collect
res4: Array[Int] = Array(200, 18, 16, 14, 12, 10, 6, 4, 2)
- sortByKey([ascending], [numTasks]) - 在一个(K,V)的RDD上调用,K必须实现Ordered接口,返回一个按照key进行排序的(K,V)的RDD
// 7、需求:按照value进行排序。
// 注意:SortByKey 按照 Key 排序
// 做法:把Key和Value交换位置
// 1、把Key Value 交换,调用SortByKey
// 2、调用完后交换回来
scala> val rdd1 = sc.parallelize(List(("Tom",1),("Andy",2),("Jerry",4),("Mike",5)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[42] at parallelize at <console>:24
scala> val rdd2 = sc.parallelize(List(("Jerry",1),("Tom",2),("Mike",4),("Kitty",5)))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[43] at parallelize at <console>:24
scala> val rdd3 = rdd1 union rdd2
rdd3: org.apache.spark.rdd.RDD[(String, Int)] = UnionRDD[44] at union at <console>:28
scala> rdd3.collect
res14: Array[(String, Int)] = Array((Tom,1), (Andy,2), (Jerry,4), (Mike,5), (Jerry,1), (Tom,2), (Mike,4), (Kitty,5))
scala> val rdd4 = rdd3.reduceByKey(_+_)
rdd4: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[45] at reduceByKey at <console>:30
scala> rdd4.collect
res15: Array[(String, Int)] = Array((Tom,3), (Jerry,5), (Andy,2), (Mike,9), (Kitty,5))
scala> val rdd5 = rdd4.map(t => (t._2,t._1)).sortByKey(false).map(t=>(t._2,t._1))
rdd5: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[50] at map at <console>:32
scala> rdd5.collect
res16: Array[(String, Int)] = Array((Mike,9), (Jerry,5), (Kitty,5), (Tom,3), (Andy,2))
// 解析执行过程:
// rdd4.map(t => (t._2,t._1)) --> (3,Tom), (5,Jerry), (2,Andy), (9,Mike), (5,Kitty)
// .sortByKey(false) --> (9,Mike),(5,Jerry),(5,Kitty),(3,Tom),(2,Andy)
// .map(t=>(t._2,t._1))--> (Mike,9), (Jerry,5), (Kitty,5), (Tom,3), (Andy,2)
高级算子
1. mapPartitionsWithIndex(func)
类似于mapPartitions,但func带有一个整数参数表示分片的索引值,因此在类型为T的RDD上运行时,func的函数类型必须是(Int, Interator[T]) => Iterator[U]。对RDD中的每个分区进行操作,带有下标,可以取到分区号。
def mapPartitionsWithIndex[U](f: (Int, Iterator[T]) ⇒ Iterator[U])
- 对RDD中的每个分区(带有下标)进行操作,下标用index来表示
- 通过这个算子可以获取分区号
- 定义一个函数,对分区进行处理
f 接收两个参数,第一个参数 代表分区号。第二个代表分区中的元素。Iterator[U] 处理完后的结果
举例:
scala> val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9),3)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24
scala> def fun1(index:Int,iter:Iterator[Int]) : Iterator[String] = {
| iter.toList.map(x => "[PartId: " + index + " , value = " + x + " ]").iterator
| }
fun1: (index: Int, iter: Iterator[Int])Iterator[String]
scala> rdd1.mapPartitionsWithIndex(fun1).collect
res3: Array[String] = Array(
[PartId: 0 , value = 1 ],
[PartId: 0 , value = 2 ],
[PartId: 0 , value = 3 ],
[PartId: 1 , value = 4 ],
[PartId: 1 , value = 5 ],
[PartId: 1 , value = 6 ],
[PartId: 2 , value = 7 ],
[PartId: 2 , value = 8 ],
[PartId: 2 , value = 9 ])
2. aggregate
先对局部数据进行聚合操作,然后再对全局数据进行聚合操作
举例:
(1)先局部求最大值,再全局相加
scala> val rdd1 = sc.parallelize(List(1,2,3,4,5),2)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> def fun1(index:Int,iter:Iterator[Int]) : Iterator[String] = {
| iter.toList.map( x => "[PartId: " + index + " , value = " + x + "]").iterator
| }
fun1: (index: Int, iter: Iterator[Int])Iterator[String]
scala> rdd1.mapPartitionsWithIndex(fun1).collect
res0: Array[String] = Array([PartId: 0 , value = 1], [PartId: 0 , value = 2], [PartId: 1 , value = 3], [PartId: 1 , value = 4], [PartId: 1 , value = 5])
scala> import scala.math._
import scala.math._
// 先对局部操作求最大值,再对全局相加 2+5=7
scala> rdd1.aggregate(0)(max(_,_),_+_)
res2: Int = 7
// 修改初始值为10 10+10+10=30
scala> rdd1.aggregate(10)(max(_,_),_+_)
res3: Int = 30
// 分析结果:
// 初始值是10,代表每个分区中,都多了一个10.
// 局部操作,每个分区的最大值都是10.
// 全局操作,10也要在全局操作时生效,即 10 + 10 + 10 = 30
(2)先局部相加,再全局相加
将 RDD的元素求和:
方式一:RDD.map
方式二:使用聚合操作
scala> val rdd1 = sc.parallelize(List(1,2,3,4,5),2)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> def fun1(index:Int,iter:Iterator[Int]) : Iterator[String] = {
| iter.toList.map( x => "[PartId: " + index + " , value = " + x + "]").iterator
| }
fun1: (index: Int, iter: Iterator[Int])Iterator[String]
scala> rdd1.mapPartitionsWithIndex(fun1).collect
res0: Array[String] = Array([PartId: 0 , value = 1], [PartId: 0 , value = 2], [PartId: 1 , value = 3], [PartId: 1 , value = 4], [PartId: 1 , value = 5])
// 先局部相加,再全局相加 3+12=15
scala> rdd1.aggregate(0)(_+_,_+_)
res5: Int = 15
// 10+(10+1+2)+(10+3+4+5) = 10+13+22 = 45
scala> rdd1.aggregate(10)(_+_,_+_)
res6: Int = 45
(3)字符串相加
scala> val rdd2 = sc.parallelize(List("a","b","c","d","e","f"),2)
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[2] at parallelize at <console>:27
scala> def fun1(index:Int,iter:Iterator[String]) : Iterator[String] = {
| iter.toList.map( x => "[PartId: " + index + " , value = " + x + "]").iterator
| }
fun1: (index: Int, iter: Iterator[String])Iterator[String]
scala> rdd2.mapPartitionsWithIndex(fun1).collect
res8: Array[String] = Array(
[PartId: 0 , value = a], [PartId: 0 , value = b], [PartId: 0 , value = c],
[PartId: 1 , value = d], [PartId: 1 , value = e], [PartId: 1 , value = f])
scala> rdd2.aggregate("")(_+_,_+_)
res10: String = abcdef
scala> rdd2.aggregate("")(_+_,_+_)
res11: String = defabc
// 注:谁快谁在前面
scala> rdd2.aggregate("*")(_+_,_+_)
res13: String = **def*abc
scala> rdd2.aggregate("*")(_+_,_+_)
res14: String = **abc*def
(4)复杂的例子
scala> val rdd3 = sc.parallelize(List("12","23","345","4567"),2)
rdd3: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[4] at parallelize at <console>:27
scala> rdd3.mapPartitionsWithIndex(fun1)
res15: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[5] at mapPartitionsWithIndex at <console>:32
scala> rdd3.mapPartitionsWithIndex(fun1).collect
res16: Array[String] = Array(
[PartId: 0 , value = 12], [PartId: 0 , value = 23],
[PartId: 1 , value = 345], [PartId: 1 , value = 4567])
scala> rdd3.aggregate("")((x,y) => math.max(x.length,y.length).toString,(x,y)=>x+y)
res20: String = 24
scala> rdd3.aggregate("")((x,y) => math.max(x.length,y.length).toString,(x,y)=>x+y)
res21: String = 42
分析:
第一个分区:
第一次比较:"" 和 “12” 比,求长度最大值:2 。2 —> “2”.
第二次比较:“2” 和 “23” 比,求长度最大值:2 。2 —> “2”.
第二个分区:
第一次比较:"" 和 “345” 比,求长度最大值:3 。3 —> “3”.
第二次比较:“3” 和 “4567” 比,求长度最大值:4 。4 —> “4”.
scala> val rdd3 = sc.parallelize(List("12","23","345",""),2)
rdd3: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[7] at parallelize at <console>:27
scala> rdd3.aggregate("")((x,y) => math.min(x.length,y.length).toString,(x,y)=>x+y)
res22: String = 01
scala> rdd3.aggregate("")((x,y) => math.min(x.length,y.length).toString,(x,y)=>x+y)
res23: String = 01
scala> rdd3.aggregate("")((x,y) => math.min(x.length,y.length).toString,(x,y)=>x+y)
res24: String = 01
scala> rdd3.aggregate("")((x,y) => math.min(x.length,y.length).toString,(x,y)=>x+y)
res25: String = 10
分析:
注意有个初始值””,其长度0,然后0.toString变成字符串"0","0"的长度是1。
第一个分区:
第一次比较:"" 和 “12” 比,求长度最小值:0 。0 —> “0”.
第二次比较:“0” 和 “23” 比,求长度最小值:1 。1 —> “1”.
第二个分区:
第一次比较:"" 和 “345” 比,求长度最小值:0 。0 —> “0”.
第二次比较:“0” 和 “” 比,求长度最小值:0 。0 —> “0”.
scala> val rdd3 = sc.parallelize(List("12","23","","345"),2)
rdd3: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at parallelize at <console>:27
scala> rdd3.aggregate("")((x,y) => math.min(x.length,y.length).toString,(x,y)=>x+y)
res26: String = 11
分析:
第一个分区:
第一次比较:"" 和 “12” 比,求长度最小值:0 。0 —> “0”.
第二次比较:“0” 和 “23” 比,求长度最小值:1 。1 —> “1”.
第二个分区:
第一次比较:"" 和 “” 比,求长度最小值:0 。0 —> “0”.
第二次比较:“0” 和 “345” 比,求长度最小值:1 。1 —> “1”.
3. aggregateByKey(zeroValue)(seqOp,combOp,[numTasks])
分组操作。调用 groupByKey,常用。类似于aggregate,操作<Key Value>的数据类型。
举例:
scala> val pairRDD = sc.parallelize(List(("cat",2),("cat",5),("mouse",4),("cat",12),("dog",12),("mouse",2)),2)
pairRDD: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[9] at parallelize at <console>:27
scala> def fun3(index:Int,iter:Iterator[(String,Int)]): Iterator[String] = {
| iter.toList.map(x => "[partID:" + index + ", val: " + x + "]").iterator
| }
fun3: (index: Int, iter: Iterator[(String, Int)])Iterator[String]
scala> pairRDD.mapPartitionsWithIndex(fun3).collect
res27: Array[String] = Array(
[partID:0, val: (cat,2)], [partID:0, val: (cat,5)], [partID:0, val: (mouse,4)],
[partID:1, val: (cat,12)], [partID:1, val: (dog,12)], [partID:1, val: (mouse,2)])
// 将每个分区中的动物最多的个数求和
scala> pairRDD.aggregateByKey(0)(math.max(_,_),_+_).collect
res28: Array[(String, Int)] = Array((dog,12), (cat,17), (mouse,6))
// 程序分析:
// 0:(cat,5)(mouse,4)
// 1:(cat,12)(dog,12)(mouse,2)
// (cat,17) (mouse,6) (dog,12)
// 将每种动物个数求和
scala> pairRDD.aggregateByKey(0)(_+_, _+_).collect
res71: Array[(String, Int)] = Array((dog,12), (cat,19), (mouse,6))
// 这个例子也可以使用:reduceByKey
scala> pairRDD.reduceByKey(_+_).collect
res73: Array[(String, Int)] = Array((dog,12), (cat,19), (mouse,6))
4. 重分区:coalesce(numPartitions)与repartition(numPartitions)
这两个算子都与分区有关系,都是对RDD进行重分区。创建RDD时可以创建分区个数。
区别:
(1)coalesce : 默认不会进行shuffle。
(2)repartition:对数据真正进行shuffle(在网络上重新分区)
scala> val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9),2)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[12] at parallelize at <console>:27
scala> val rdd2 = rdd1.repartition(3)
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[16] at repartition at <console>:29
scala> rdd2.partitions.length
res29: Int = 3
scala> val rdd3 = rdd1.coalesce(3)
rdd3: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[17] at coalesce at <console>:29
scala> rdd3.partitions.length
res30: Int = 2
scala> val rdd3 = rdd1.coalesce(3,true)
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[21] at coalesce at <console>:29
scala> rdd3.partitions.length
res31: Int = 3
// 下面两句话是等价的:
val rdd2 = rdd1.repartition(3)
val rdd3 = rdd1.coalesce(3,true) // -->如果是false,查看RDD的length依然是2。默认不写是false。
5. 其他高级算子
http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html