前言

在前面的几章内, 我们分别介绍了Spark的安装Spark Shell的基本操作. 本章, 我们注重介绍下Spark的基本算子.

Spark的相关权威的介绍建议查看 http://spark.apache.org/docs/latest .
本文对于其进行部分个人理解上的加工.


基础知识

RDD

Resilient Distributed Dataset (RDD), 弹性分布式数据集的简称. Spark的代码内有这样一段注释.

/**
 * A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,
 * partitioned collection of elements that can be operated on in parallel. This class contains the
 * basic operations available on all RDDs, such as `map`, `filter`, and `persist`. In addition,
 * [[org.apache.spark.rdd.PairRDDFunctions]] contains operations available only on RDDs of key-value
 * pairs, such as `groupByKey` and `join`;
 * [[org.apache.spark.rdd.DoubleRDDFunctions]] contains operations available only on RDDs of
 * Doubles; and
 * [[org.apache.spark.rdd.SequenceFileRDDFunctions]] contains operations available on RDDs that
 * can be saved as SequenceFiles.
 * All operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)]
 * through implicit.
 *
 * Internally, each RDD is characterized by five main properties:
 *
 *  - A list of partitions
 *  - A function for computing each split
 *  - A list of dependencies on other RDDs
 *  - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
 *  - Optionally, a list of preferred locations to compute each split on (e.g. block locations for
 *    an HDFS file)
 *
 * All of the scheduling and execution in Spark is done based on these methods, allowing each RDD
 * to implement its own way of computing itself. Indeed, users can implement custom RDDs (e.g. for
 * reading data from a new storage system) by overriding these functions. Please refer to the
 * <a href="http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf">Spark paper</a>
 * for more details on RDD internals.
 */

简单来说, 可以概括如下:

* 一系列的分区;
* 一个Function会计算每一个分区;(函数作用于每一个分区);
* RDD 与 RDD之间会有依赖关系;
* 可选, key-vale 类型的可以有分区器(默认是HashPartition);
* 可选, RDD数据在最佳位置进行计算. (数据本地化, 移动计算 替代 移动数据)
* 注意: 一个分区肯定在一台机器上, 一个机器上可能有多个分区.
Transformation & Action

Spark的函数算子分为TransformationAction两种.

  • Transformation的操作不是立即执行的, 延迟执行.当遇到Action的操作时, 才会立即执行. 其在进行操作时, 会记录元数据信息.
  • Action的操作是在主结点上进行执行的?

基本算子

Spark大约80多算子, 常用多算子大概20多种.

基本算子(Transform)

Transformation

Meaning

map(func)

Return a new distributed dataset formed by passing each element of the source through a function func.

filter(func)

Return a new dataset formed by selecting those elements of the source on which func returns true.

flatMap(func)

Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).

mapPartitions(func)

Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator => Iterator when running on an RDD of type T.

mapPartitionsWithIndex(func)

Similar to mapPartitions, but also provides func with an integer value representing the index of the partition, so func must be of type (Int, Iterator) => Iterator when running on an RDD of type T.

sample(withReplacement, fraction, seed)

Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed.

union(otherDataset)

Return a new dataset that contains the union of the elements in the source dataset and the argument.

intersection(otherDataset)

Return a new RDD that contains the intersection of elements in the source dataset and the argument.

distinct([numPartitions]))

Return a new dataset that contains the distinct elements of the source dataset.

groupByKey([numPartitions])

When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable) pairs.

Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance.

Note: By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optional numPartitions argument to set a different number of tasks.

reduceByKey(func, [numPartitions])

When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.

aggregateByKey(zeroValue)(seqOp, combOp, [numPartitions])

When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral “zero” value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.

sortByKey([ascending], [numPartitions])

When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.

join(otherDataset, [numPartitions])

When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.

cogroup(otherDataset, [numPartitions])

When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable, Iterable)) tuples. This operation is also called groupWith.

cartesian(otherDataset)

When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).

pipe(command, [envVars])

Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process’s stdin and lines output to its stdout are returned as an RDD of strings.

coalesce(numPartitions)

Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.

repartition(numPartitions)

Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.

repartitionAndSortWithinPartitions(partitioner)

Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling repartition and then sorting within each partition because it can push the sorting down into the shuffle machinery.

基本算子(Action)

Action

Meaning

reduce(func)

Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.

collect()

Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

count()

Return the number of elements in the dataset.

first()

Return the first element of the dataset (similar to take(1)).

take(n)

Return an array with the first n elements of the dataset.

takeSample(withReplacement, num, [seed])

Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.

takeOrdered(n, [ordering])

Return the first n elements of the RDD using either their natural order or a custom comparator.

saveAsTextFile(path)

Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.

saveAsSequenceFile(path)

(Java and Scala) Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that implement Hadoop’s Writable interface. In Scala, it is also available on types that are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc).

saveAsObjectFile(path)

(Java and Scala) Write the elements of the dataset in a simple format using Java serialization, which can then be loaded using SparkContext.objectFile().

countByKey()

Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key.

foreach(func)

Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems.

Note: modifying variables other than Accumulators outside of the foreach() may result in undefined behavior. See Understanding closures for more details.


基本操作(Transformation)

创建RDD
  • 通过HDFS支持的文件系统创建RDD, RDD里面没有真正需要计算的数据, 只记录一下元数据;
  • 通过Scala集合或数组以并行化的方式创建RDD.
  • 创建RDD的方式(文本文件)
scala> sc.textFile("hdfs://localhost:9000/wordcount/input").collect
res23: Array[String] = Array(hello 2019, cat, pitty, kitty, able, pitty, cat)
  • 创建RDD的方式(Scala集合)
scala> val rdd1 = sc.parallelize(Array(1,2,3,4,5,6,7))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[87] at parallelize at <console>:24

scala> val rdd1 = sc.parallelize(Array(1,2,3,4,5,6,7)).collect
rdd1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7)

scala> val rdd1 = sc.parallelize(Array(1,2,3,4,5,6,7).toList)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[89] at parallelize at <console>:24

scala> val rdd1 = sc.parallelize(Array(1,2,3,4,5,6,7).toList).collect
rdd1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7)

scala> val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7)).collect
rdd1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7)
  • 创建RDD的方式(JavaRDD)
import java.util.Arrays;
import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class FirstRddDemo {
	// The main entrance
	public static void main(String[] args) {
//		String masterUrl = "spark://localhost:7077";
		String masterUrl = "spark://192.168.100.71:7077";

		// The first step is to get the java spark context.
		SparkConf conf = new SparkConf().setAppName("Hello World!").setMaster(masterUrl);
		JavaSparkContext sc = new JavaSparkContext(conf);
		
		List<Integer> data = Arrays.asList(1,2,3,4,5);
		JavaRDD<Integer> distData = sc.parallelize(data);
		
		distData.reduce((a,b) -> a+b);
		
	}

}
  • map()
scala>  val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

# 并没有进行计算(而是将运算进行记录)
scala> val rdd2  = rdd1.map(_*10)
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:26
  • filter()
scala>  val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> val rdd2  = rdd1.map(_*10)
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:26

# 并没有进行计算(而是将运算进行记录)
scala> val rdd3 = rdd2.filter(_ > 50)
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[2] at filter at <console>:28
分区与指定分区
  • partition()
scala>  val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

# 默认分配
scala> rdd1.partitions.length
res0: Int = 4

# 指定分区数目
scala> val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7),5)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24

scala> rdd1.partitions.length
res1: Int = 5
排序
  • sortBy
scala> val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,10,9,8),5)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[22] at parallelize at <console>:24

# 数字 顺序排序
scala> rdd1.map(_*10).sortBy( x => x, true).collect
res8: Array[Int] = Array(10, 20, 30, 40, 50, 60, 70, 80, 90, 100)

# 字符串 字典顺序排序
scala> rdd1.map(_*10).sortBy( x => x+" ", true).collect
res9: Array[Int] = Array(10, 100, 20, 30, 40, 50, 60, 70, 80, 90)
过滤
  • filter
scala> rdd1.map(_*10).sortBy( x => x+" ", true).filter(_ > 10).collect
res10: Array[Int] = Array(100, 20, 30, 40, 50, 60, 70, 80, 90)
flatMap
  • flatMap()
scala> val rdd4 = sc.parallelize(Array("a b c", "d e f","g e f"))
rdd4: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[42] at parallelize at <console>:24

scala> rdd4.flatMap(_.split(" ")).collect
res11: Array[String] = Array(a, b, c, d, e, f, g, e, f)

# 双重flatMap
scala> val rdd5 = sc.parallelize(List(List("a b c"), List("d e f"),List("g e f")))
rdd5: org.apache.spark.rdd.RDD[List[String]] = ParallelCollectionRDD[44] at parallelize at <console>:24

scala> rdd5.flatMap(_.flatMap(_.split(" "))).collect
res12: Array[String] = Array(a, b, c, d, e, f, g, e, f)
交&并&差
  • union
scala> val rdd7 = sc.parallelize(List(1,2,3,4,5))
rdd7: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[46] at parallelize at <console>:24

scala>

scala> val rdd8 = sc.parallelize(List(2,3,4,5,6,7))
rdd8: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[47] at parallelize at <console>:24

# 并(因为是List 所以可以重复)
scala> rdd7.union(rdd8).collect
res13: Array[Int] = Array(1, 2, 3, 4, 5, 2, 3, 4, 5, 6, 7)


# 交
scala> rdd7.intersection(rdd8)
res14: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[54] at intersection at <console>:29

scala> rdd7.intersection(rdd8).collect
res15: Array[Int] = Array(4, 5, 2, 3)
  • join() & leftOuterJoin() & rightOuterJoin
# join 0次
scala> val rdd9 = sc.parallelize(List(("a",1),("b",2),("c",3)))
rdd9: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[61] at parallelize at <console>:24
scala> val rdd10 = sc.parallelize(List(("d",4),("e",5),("f",6)))
rdd10: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[62] at parallelize at <console>:24

scala> rdd9.join(rdd10).collect
res17: Array[(String, (Int, Int))] = Array()

# join 单次
scala> val rdd9 = sc.parallelize(List(("a",1),("b",2),("c",3)))
rdd9: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[61] at parallelize at <console>:24
scala> val rdd10 = sc.parallelize(List(("a",1),("d",4),("e",5),("f",6)))
rdd10: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[69] at parallelize at <console>:24

scala> rdd9.join(rdd10).collect
res18: Array[(String, (Int, Int))] = Array((a,(1,1)))

# join两次
scala> val rdd9 = sc.parallelize(List(("a",1),("b",2),("c",3)))
rdd9: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[61] at parallelize at <console>:24
scala> val rdd10 = sc.parallelize(List(("a",11),("a",111),("d",4),("e",5),("f",6)))
rdd10: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[73] at parallelize at <console>:24

scala> rdd9.join(rdd10).collect
res19: Array[(String, (Int, Int))] = Array((a,(1,11)), (a,(1,111)))

# leftOuterJoin
scala> rdd9.leftOuterJoin(rdd10).collect
res21: Array[(String, (Int, Option[Int]))] = Array((a,(1,Some(11))), (a,(1,Some(111))), (b,(2,None)), (c,(3,None)))

# rightOuterJoin
scala> rdd9.rightOuterJoin(rdd10).collect
res24: Array[(String, (Option[Int], Int))] = Array((d,(None,4)), (e,(None,5)), (a,(Some(1),11)), (a,(Some(1),111)), (f,(None,6)))
groupByKey
scala> val rdd3 = rdd9 union rdd10
rdd3: org.apache.spark.rdd.RDD[(String, Int)] = UnionRDD[86] at union at <console>:28

scala> rdd3.collect
res25: Array[(String, Int)] = Array((a,1), (b,2), (c,3), (a,11), (a,111), (d,4), (e,5), (f,6))
scala> rdd3.groupByKey().collect

res27: Array[(String, Iterable[Int])] = Array((a,CompactBuffer(1, 11, 111)), (b,CompactBuffer(2)), (c,CompactBuffer(3)), (d,CompactBuffer(4)), (e,CompactBuffer(5)), (f,CompactBuffer(6)))
# 出现多少次

scala> rdd3.groupByKey().map(x => (x._1, x._2.sum )).collect
res30: Array[(String, Int)] = Array((a,123), (b,2), (c,3), (d,4), (e,5), (f,6))

# mapValues
scala> rdd3.groupByKey().mapValues(_.sum).collect
res33: Array[(String, Int)] = Array((a,123), (b,2), (c,3), (d,4), (e,5), (f,6))
WordCount(groupByKey&reduceByKey)
scala> sc.textFile("hdfs://localhost:9000/wordcount/input").flatMap(_.split(" ")).map((_,1)).reduceByKey((x,y) => (x+y)).collect
res20: Array[(String, Int)] = Array((hello,1), (pitty,2), (able,1), (2019,1), (cat,2), (kitty,1))

scala> sc.textFile("hdfs://localhost:9000/wordcount/input").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect
res21: Array[(String, Int)] = Array((hello,1), (pitty,2), (able,1), (2019,1), (cat,2), (kitty,1))

scala> sc.textFile("hdfs://localhost:9000/wordcount/input").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_,1).collect
res22: Array[(String, Int)] = Array((hello,1), (pitty,2), (able,1), (2019,1), (kitty,1), (cat,2))


scala> sc.textFile("hdfs://localhost:9000/wordcount/input").flatMap(_.split(" ")).map((_,1)).groupByKey.map(t=>(t._1,t._2.sum)).collect
res42: Array[(String, Int)] = Array((hello,1), (pitty,2), (able,1), (2019,1), (cat,2), (kitty,1))

--------------------- 
作者:在风中的意志 
来源:CSDN 
原文: 
版权声明:本文为博主原创文章,转载请附上博文链接!
cogroup
scala> rdd10.collect
res35: Array[(String, Int)] = Array((a,11), (a,111), (d,4), (e,5), (f,6))

scala> rdd9.cogroup(rdd10)
res36: org.apache.spark.rdd.RDD[(String, (Iterable[Int], Iterable[Int]))] = MapPartitionsRDD[94] at cogroup at <console>:29

scala> rdd9.cogroup(rdd10).collect
res37: Array[(String, (Iterable[Int], Iterable[Int]))] = Array((d,(CompactBuffer(),CompactBuffer(4))), (e,(CompactBuffer(),CompactBuffer(5))), (a,(CompactBuffer(1),CompactBuffer(11, 111))), (b,(CompactBuffer(2),CompactBuffer())), (f,(CompactBuffer(),CompactBuffer(6))), (c,(CompactBuffer(3),CompactBuffer())))

# (a,(CompactBuffer(1),CompactBuffer(11, 111)))
# 第一个集合有1 第二个集合有11,和111
# 先在一个rdd的内部进行分组, 再与第二个rdd进行join操作
# 功能强于join 但是效率不高 (join 基于 cogroup进行实现的)
笛卡尔集 - cartesian
scala> val rdd9 = sc.parallelize(List(("a",1),("b",2),("c",3)))
rdd9: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[61] at parallelize at <console>:24
scala> val rdd10 = sc.parallelize(List(("a",11),("a",111),("d",4),("e",5),("f",6)))
rdd10: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[73] at parallelize at <console>:24

scala> rdd9.cartesian(rdd10)
res38: org.apache.spark.rdd.RDD[((String, Int), (String, Int))] = CartesianRDD[97] at cartesian at <console>:29

scala> rdd9.cartesian(rdd10).collect
res39: Array[((String, Int), (String, Int))] = Array(((a,1),(a,11)), ((a,1),(a,111)), ((a,1),(d,4)), ((a,1),(e,5)), ((a,1),(f,6)), ((b,2),(a,11)), ((b,2),(a,111)), ((b,2),(d,4)), ((b,2),(e,5)), ((b,2),(f,6)), ((c,3),(a,11)), ((c,3),(a,111)), ((c,3),(d,4)), ((c,3),(e,5)), ((c,3),(f,6)))

基本操作(Action)

  • collect
scala> val rdd1 = sc.parallelize(List(1,2,3,4,5),2)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> rdd1.collect
res0: Array[Int] = Array(1, 2, 3, 4, 5)
  • reduce
scala> rdd1.reduce(_+_)
res1: Int = 15
  • count
scala> rdd1.count
res2: Long = 5
  • top
# 排序取 最大
scala> rdd1.top(2)
res3: Array[Int] = Array(5, 4)
  • take
# 取前2个
scala> rdd1.take(2)
res4: Array[Int] = Array(1, 2)
  • first
scala> rdd1.first
res5: Int = 1
  • takeOrdered
# 可以提交排序规则(升序)
scala> rdd1.takeOrdered(3)
res6: Array[Int] = Array(1, 2, 3)

Reference(Spark)

(Official)RDD Programming Guide


Reference(MarkDown)

Markdown创建表格markdown表格内如何进行换行