本文主要用于学习,资料为官网内容,文中最后附有链接。部分没有进行译出。此为初稿。示例代码为Python。

    通过对官网该片文章的走读,学习了RDD的概念及RDD的生成方式;了解在RDD上进行的各种操作,以及何时会触发shuffle操作,shuffle对性能的影响;同时强化了分布式环境的编程概念,主要涉及分布式编程环境和单机环境的差异(closure一章),及这种差异下采用广播及累加器变量的方法。此外了解了数据集在内存驻留的操作及创建一个Spark application中所需要的基本步骤。

     在此过程中,会涉及function 、method、operation三个概念。第一个是用户编写的函数;第二个是spark的一些特定函数,例如获取广播变量值的value等;第三个是针对RDD描述的。

      纵观整个Guide,实际上首先创建RDD,而后在此上进行具体的操作,并进行性能 优化,仅此 而已。

      另外文中涉及到task 、stage 、driver、executor等在官网其他章节有介绍,在阅读此篇文章前需要先了解这些基本概念!

1. 基本概念

 每个spark application 包含有一个 driver program,这个driver运行用户的main函数、在集群上执行各种并发操作。



(1)spark主要的概念是RDD(resilient distributed dataset),RDD是partitioned到集群中各个并发运行节点的元素集合。


RDD首先被partitioned,然后被分配到集群中各个节点进行处理。此处如何将一个RDD partitioned是关键!!



RDD可以根据Hadoop文件系统(或者其他hadoop支持的文件系统)中的一个文件创建,或者根据一个在driver program中已经存在的Scala colletion 创建,然后进行变换(transforming)。


用户可以将RDD persist在内存中,以便后续再次使用,提高性能。


最后分布到某个集群节点上的RDD失效时,RDD可以自动恢复。



(2) Spark的另外一个主要概念是 shared variables,它可以在并行运算中被使用。默认情况,当spark在不同的节点中并行一个函数(function)时,它将此函数(function)中使用的每个变量分发到每个task中。


而有时,一个变量需要在所有tasks共享、或者在tasks与driver program之间共享。


spark支持两种类型的共享变量: broadcast 变量:可以在所有节点缓冲此变量;accumulators:仅能被加。


 2. linking with Spark

   (1) 版本:1.5.2 版本支持Python 2.6+ or Python 3.4+,使用Cpython 解释器,也可以使用PyPy 2.3+


   (2) 运行:使用bin/spark-submit脚本,这个脚本加载Spark's Java/Scala库,并且将applications提交到集群中。也可以使用bin/pyspark 来启动一个Python shell,用python交互命令玩。


   (3)数据来源: 如果我们想访问HDFS上面的数据,我们需要采用合适的HDFS版本。


   (4)Spark class: 我们需要将spark的一些类添加到我们的程序中,例如,添加:


   

from pyspark import SparkContext, SparkConf



         PySpark 要求driver和workers上使用相同(minor version)的Python版本,它使用在PATH下面的python版本,可以使用PYSPARK_PYTHON指定想要使用的Python版本:


        $ PYSPARK_PYTHON=python3.4 bin/pyspark


        $ PYSPARK_PYTHON=/opt/pypy-2.5/bin/pypy bin/spark-submit examples/src/main/python/pi.py


3. 初始化Spark

        Spark程序首先必须创建SparkContext对象,此对象用于告诉Spark如何访问一个集群。为了创建一个SparkContext,首先构造一个SparkConf结构体,这个结构体用于配置application的一些信息。


 

conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf=conf)



          appName参数是application的名字,在集群UI中显示。master是Spark ,Mesos 或者YARN的 cluster URL,抑或是运行在local mode时的一个特殊的"local"字符串。当运行在集群中时,you will not want to hardcode master in the program, but rather launch the application with spark-submit and receive it there. However, for local testing and unit tests, you can pass “local” to run Spark in-process.


     4. 使用shell

  在PySpark shell中,一个特殊的SparkContex已经被创建,用变量 sc代表。 可以使用参数--py-files指定要运行的py脚本文件。


$ ./bin/pyspark --master local[4] --py-files code.py


  5. RDD详解


(1)   Parallelized Collections

  通过调用SparkContext的parallelize 方法来创建,原始的数据为driver 中的iterable 或者collection。collection中的元素被拷贝而后构成一个可以并行运算的分布式数据集。


  例如:创建一个从1 到5的 parallelized collections


 

data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)



         被创建后,此分布式数据集(distData)可以被并行的运算。例如通过调用distData.reduce(lambda a,b: a+b)可以将所有元素相加。


         parallel collections 的一个重要参数是此数据集被切分成的partitions数目。Spark将针对集群中的每个partition创建一个task。


         典型地,要在集群中让每个CPU处理2-4个partitions。正常情况下,Spark根据集群的实际情况来设定partitions的数目。但是,也可以通过parallelize方法的参数来传递需要设置的个数,例如:sc.parallelize(data, 10)。


(2) External Datasets

    PySpark可以由Hadoop支持的存储源来创建分布式数据集,这些存储源包括 local 文件系统,HDFS, Cassandra,Hbase,Amazon S3等等。


    Spark支持text files,SequenceFiles,以及其他Hadoop 输入格式。


   Text file RDDS使用SparkContxt  的textFile方法创建,这个方法采用URI表示一个文件(本地路径、 hdfs:// ,s3n://),然后读取此文件,并依次创建collections. collection的元素为


text file的一行。



 

>>> distFile = sc.textFile("data.txt")



   创建后,可以在此数据集合上进行reduce等操作。


   读取文件时的要点:


   ---如果使用本地文件系统的路径,这个文件及其所在路径也必须能够被work nodes访问。或者拷贝文件到所有的workers中,或者使用一个NFS文件系统。


   ---所有Spark基于文件的输入方法,读取目录、压缩文件、以及通配符。例如:


        textFile("/my/directory"), textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz").


  ---textFile 方法的第二个参数也是可选的,用于控制partitions的个数。默认情况,Spark为每个文件块(HDFS中,blocks默认大小为64M)创建一个partition。


可以通过参数创建一个多于文件块数的partitions,但不能少于文件块数。例如,默认情况,HDFS中一个文件,大小为640M,则需要至少创建10个partition。


 

除了text files,Spark Python API也支持其他几种数据格式:

  ---SparkContext.wholeTextFiles ,读取一个包含有多个小text 文件的目录。并且返回一组(filename,content) 。然后通过textFile处理。


  ---RDD.saveAsPickeledFile  和SparkContext.pickleFile支持保存一个RDD到一个格式化的Python 对象中。


  ---SequenceFile 和Hadoop Input/Output Formats


    Writable Support


PySpark SequenceFile support loads an RDD of key-value pairs within Java, converts Writables to base Java types, and pickles the resulting Java objects using Pyrolite. When saving an RDD of key-value pairs to SequenceFile, PySpark does the reverse. It unpickles Python objects into Java objects and then converts them to Writables. The following Writables are automatically converted:


Writable Type

Python Type

Text

unicode str

IntWritable

int

FloatWritable

float

DoubleWritable

float

BooleanWritable

bool

BytesWritable

bytearray

NullWritable

None

MapWritable

dict



Arrays are not handled out-of-the-box. Users need to specify custom ArrayWritable subtypes when reading or writing. When writing, users also need to specify custom converters that convert arrays to custom ArrayWritable subtypes. When reading, the default converter will convert custom ArrayWritable subtypes to Java Object[], which then get pickled to Python tuples. To get Python array.array for arrays of primitive types, users need to specify custom converters.


    Saving and Loading SequenceFiles


  Similarly to text files, SequenceFiles can be saved and loaded by specifying the path. The key and value classes can be specified, but for standard Writables this is not required.


>>> rdd = sc.parallelize(range(1, 4)).map(lambda x: (x, "a" * x ))
>>> rdd.saveAsSequenceFile("path/to/file")
>>> sorted(sc.sequenceFile("path/to/file").collect()) #读取的接口是什么?
[(1, u'a'), (2, u'aa'), (3, u'aaa')]



Saving and Loading Other Hadoop Input/Output Formats


PySpark can also read any Hadoop InputFormat or write any Hadoop OutputFormat, for both ‘new’ and ‘old’ Hadoop MapReduce APIs. If required, a Hadoop configuration can be passed in as a Python dict. Here is an example using the Elasticsearch ESInputFormat:


$ SPARK_CLASSPATH=/path/to/elasticsearch-hadoop.jar ./bin/pyspark


>>> conf = {"es.resource" : "index/type"}   # assume Elasticsearch is running on localhost defaults
>>> rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat",\
"org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
>>> rdd.first() # the result is a MapWritable that is converted to a Python dict
(u'Elasticsearch ID',
{u'field1': True,
u'field2': u'Some Text',
u'field3': 12345})



Note that, if the InputFormat simply depends on a Hadoop configuration and/or input path, and the key and value classes can easily be converted according to the above table, then this approach should work well for such cases.


If you have custom serialized binary data (such as loading data from Cassandra / HBase), then you will first need to transform that data on the Scala/Java side to something which can be handled by Pyrolite’s pickler. A Converter trait is provided for this. Simply extend this trait and implement your transformation code in the convert method. Remember to ensure that this class, along with any dependencies required to access your InputFormat, are packaged into your Spark job jar and included on the PySpark classpath.


See the Python examples and the Converter examples for examples of using Cassandra / HBase InputFormat and OutputFormat with custom converters.



6. RDD Operations

 


     RDDs支持两种类型的操作,transformations:从一个已经存在的数据集创建一个新的数据集;actions:在一个数据集上进行运算后,将运算结果返回给driver program。


     Spark中,所有的transformation操作都是lazy的,即它们并不立即计算结果,而仅仅记录用于某些base dataset的 transformations操作,仅当遇到一个action需要向driver program返回结果时,在此之前的transformations才会真正地被计算。


     此种设计使得spark可以更加有效地运行。


     默认情况,每次针对一个RDD进行actions时,之前已经transformed RDD会被重新计算。但是,可以在将transformed RDD persist在内存或者cache中。


6.1 Basics

lines = sc.textFile("data.txt")
lineLengths = lines.map(lambda s: len(s))
totalLength = lineLengths.reduce(lambda a, b: a + b)



    第一行根据外部文件定义了个base RDD,这个数据集不被加载到内存中,lines仅仅是一个指向文件的指针。


    第二行,定义个map transformation的结果,lineLenghts也不被立即计算。


   最后,运行reduce,这是一个action,此时Spark将计算任务拆分成tasks,然后将tasks分发到不同的机器上运行。每个机器运行map的一部分以及本地reduction,而后将结果返回给driver program。


   如果想要再次使用lineLengths,则可以使用下面语句将数据集保留在内存中:


lineLengths.persist()



6.2 向Spark传递函数

    Spark 的API很大程度上依赖于driver program 中的函数,这些函数被分发到集群上运行。有三种方式来实现这些函数传递:


   --- Lambda 表达式。对于简单的函数,可以用此种表达式的形式来写。


  ----Local defs 函数内的定义。 Local defs inside the function calling into Spark, for longer code.


  ---一个模块内的函数


(1) For example, to pass a longer function than can be supported using a lambda, consider the code below:


"""MyScript.py"""
if __name__ == "__main__":
def myFunc(s):
words = s.split(" ")
return len(words)

sc = SparkContext(...)
sc.textFile("file.txt").map(myFunc)



(2) 使用类中的函数


Note that while it is also possible to pass a reference to a method in a class instance (as opposed to a singleton object), this requires sending the object that contains that class along with the method. For example, consider:


class MyClass(object):
def func(self, s):
return s
def doStuff(self, rdd):
return rdd.map(self.func)



Here, if we create a new MyClass and call doStuff on it, the map inside there references the func method of that MyClass instance,


so the whole object needs to be sent to the cluster


(3) 外部对象的方法


  In a similar way, accessing fields of the outer object will reference the whole object:


class MyClass(object):
def __init__(self):
self.field = "Hello"
def doStuff(self, rdd):
return rdd.map(lambda s: self.field + s)



To avoid this issue, the simplest way is to copy field into a local variable instead of accessing it externally:


def doStuff(self, rdd):


    field = self.field


    return rdd.map(lambda s: field + s)


7. 理解 closures

    在spark中,当在一个集群中执行代码时,如何立即变量和方法的作用域和生命周期?


    修改作用域范围外变量的RDD 操作可能经常引发某些异常。


    接下来,我们了解下使用foreach()来增加一个计数器的操作,当然类似问题在其他operations也会发生:


   7.1 例子:

   以RDD元素加为例,依赖于是否发生在同一个JVM中,此操作的行为完全不同的。


  常见的例子,首先在local mode运行spark,而后将Spark application 部署到一个集群中。(调试的环境和最终部署的环境要一致,不然可能出现某些问题。)


counter = 0
rdd = sc.parallelize(data)

# Wrong: Don't do this!!
rdd.foreach(lambda x: counter += x)

print("Counter value: " + counter)



  7.2 local vs. cluster modes

      主要挑战是上面代码的behavior是未定义的。


      在带有单个JVM的local mode,上面的代码将累加RDD的值,并保存在counter变量中,这是因为RDD和counter变量都在driver节点上,在相同的地址空间中。


      然而,在cluster mode,事情更加复杂,上面的代码并不会如预想的那样运行。为了执行jobs,Spark 会将针对RDD的操作进行拆分,分成N个tasks,每个在一个executor上运行。在运算之前,Spark首先计算closure,closure是一些变量和方法,对于在RDD上执行运算的executor来说,这些变量和方法是可见的。


      这个closure被serialized,然后被送往每个executor上,在local mode,仅有一个executor,因而大家共用相同的closure。然而在其他modes,情况并不如此,每个运行在独立worker nodes上的 executors都有一份closure的拷贝。这儿的问题在于,closure中的变量被拷贝到每个executors,当在foreach中引用counter变量时,这个变量已经不是driver node上的那个了。在driver node 上仍然有一个counter,但是这个counter已经不被executors所知晓了!executors仅从serialized closure中看到一份拷贝。因而,counter的最终值依然为0!因为所有针对counter的操作实际都是针对serialized closure中的副本!


       为了使behavior可以预知,应该使用累加器(accumulator)。在Spark中,accumulator用于安全更新一个变量的值。


       通常,closures-constructs like loops or locally defined methods ,不应当被用来mutat某些全局的状态。Spark没有定义或者保证mutation到closures外部引用对象的行为。


      一些代码在本地运行正常,然而在cluster中,会偶发故障,行为不可预料。如果某些全局的aggregation需要,也使用累加器!



7.3  打印RDD中的元素

       使用rdd.foreach(println) or rdd.map(println)打印RDD中的元素,在local mode,结果如期望。但是在cluster mode,是executors执行到sdout的输出,因而在driver中的输出将不显示。为了打印driver中所有的元素:首先使用collect收集RDD到driver 节点,rdd.collect().foreach(println).,这个操作可能引起driver出现OOM。


       因为collect获取整个的RDD到一台机器上,即将一个集群(可能10几、百十台)中运算的数据收到一台机器上,撑死了!


       如果仅仅想打印RDD中的少许元素,则可以使用一个更加安全的方法: rdd.take(100).foreach(println).



8.  Working with Key-Value Pairs

While most Spark operations work on RDDs containing any type of objects, a few special operations are only available on RDDs of key-value pairs. The most common ones are distributed “shuffle” operations, such as grouping or aggregating the elements by a key. 大多数的spark操作可以针对任意类型的对象实体进行操作;有少数特别的操作,仅仅用于key-value对中,最常见的是shuffle 操作


In Python, these operations work on RDDs containing built-in Python tuples such as (1, 2). Simply create such tuples and then call your desired operation.


For example, the following code uses the reduceByKey operation on key-value pairs to count how many times each line of text occurs in a file:


lines = sc.textFile("data.txt")
pairs = lines.map(lambda s: (s, 1))
counts = pairs.reduceByKey(lambda a, b: a + b)



We could also use counts.sortByKey(), for example, to sort the pairs alphabetically, and finally counts.collect() to bring them back to the driver program as a list of objects.



9. Transformations & Actions



The following table lists some of the common transformations supported by Spark. Refer to the RDD API doc (Scala, Java, Python, R) and pair RDD functions doc (Scala, Java) for details.


Transformation

Meaning

map(func)

Return a new distributed dataset formed by passing each element of the source through a functionfunc.

filter(func)

Return a new dataset formed by selecting those elements of the source on whichfunc returns true.

flatMap(func)

Similar to map, but each input item can be mapped to 0 or more output items (sofunc should return a Seq rather than a single item).

mapPartitions(func)

Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T.

mapPartitionsWithIndex(func)

Similar to mapPartitions, but also provides func with an integer value representing the index of the partition, sofunc must be of type (Int, Iterator<T>) => Iterator<U> when running on an RDD of type T.

sample(withReplacement, fraction, seed)

Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed.

union(otherDataset)

Return a new dataset that contains the union of the elements in the source dataset and the argument.

intersection(otherDataset)

Return a new RDD that contains the intersection of elements in the source dataset and the argument.

distinct([numTasks]))

Return a new dataset that contains the distinct elements of the source dataset.

groupByKey([numTasks])

When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.

Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using​​reduceByKey​​​ or​​aggregateByKey​​ will yield much better performance.

Note: By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optional​​numTasks​​ argument to set a different number of tasks.

reduceByKey(func, [numTasks])

When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce functionfunc, which must be of type (V,V) => V. Like in​​groupByKey​​, the number of reduce tasks is configurable through an optional second argument.

aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])

When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral "zero" value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like in ​​groupByKey​​, the number of reduce tasks is configurable through an optional second argument.

sortByKey([ascending], [numTasks])

When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean​​ascending​​ argument.

join(otherDataset, [numTasks])

When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through​​leftOuterJoin​​​,​​rightOuterJoin​​​, and​​fullOuterJoin​​.

cogroup(otherDataset, [numTasks])

When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable<V>, Iterable<W>)) tuples. This operation is also called​​groupWith​​.

cartesian(otherDataset)

When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).

pipe(command, [envVars])

Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process's stdin and lines output to its stdout are returned as an RDD of strings.

coalesce(numPartitions)

Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.

repartition(numPartitions)

Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.

repartitionAndSortWithinPartitions(partitioner)

Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling​​repartition​​ and then sorting within each partition because it can push the sorting down into the shuffle machinery.



Actions


The following table lists some of the common actions supported by Spark. Refer to the RDD API doc (Scala, Java, Python, R)


and pair RDD functions doc (Scala, Java) for details.


Action

Meaning

reduce(func)

Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.

collect()

Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

count()

Return the number of elements in the dataset.

first()

Return the first element of the dataset (similar to take(1)).

take(n)

Return an array with the first n elements of the dataset.

takeSample(withReplacement, num, [seed])

Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.

takeOrdered(n, [ordering])

Return the first n elements of the RDD using either their natural order or a custom comparator.

saveAsTextFile(path)

Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.

saveAsSequenceFile(path)

(Java and Scala)

Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that implement Hadoop's Writable interface. In Scala, it is also available on types that are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc).

saveAsObjectFile(path)

(Java and Scala)

Write the elements of the dataset in a simple format using Java serialization, which can then be loaded using​​SparkContext.objectFile()​​.

countByKey()

Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key.

foreach(func)

Run a function func on each element of the dataset. This is usually done for side effects such as updating an​​Accumulator​​ or interacting with external storage systems.

Note: modifying variables other than Accumulators outside of the​​foreach()​​​ may result in undefined behavior. See​​Understanding closures ​​for more details.



10.  Shuffle operations



     某些操作可能触发一个称为shuffle的事件。shuffle是Spark用于重新分布数据的一种机制,使用此种机制便于跨partitions组织数据。典型的情景为:跨executors和machines进行数据拷贝,使得shuffle变得复杂而且耗时。


      以reduceByKey为例说明shuffle。 reduceByKey会产生一个新的RDD,在此RDD中,所有对应同一个key的数据被分到一个tuple中,tuple中含有此key和在与此key相关联的数据上进行reduce操作的结果。问题是,对于某一个key来讲,不是与此key相关的数据都在同一个partition或者同一个machine中,但是这些数据又必须协同工作来计算出结果。


     在Spark中,数据通常不会被跨partitions分别。(to be in the necessary place for a specific operation)。在计算过程中,一个task在一个partition上进行运算操作。因而,为了组织所有的数据,以便给一个reduceByKey reduce task来运行。Spark需要执行一个shuffle操作。它必须从所有的partitions中为 所有的key找到所有与各个key对应的数据,然后为单个key值将与此key相关联的数据集合起来计算出最后的结果。这个操作即为shuffle操作。shuffle是跨partition的操作?


      尽管在shuffed data后的每个partiton中的元素集合时确定,而且partitions本身也是有序的,但是这些元素并不是有序的,如果想在shuffle后预测有序数据,则可能进行以下操作:


   ---mapPartitions 来排序每个partition。例如使用sorted。


   ---repartitionAndSortWithinPartitions 来排序partitions并且重新repartitionig。


   ---sortBy 生成一个有序的RDD


    可能引起一个shuffle的操作包括 :repartition operations like repartition and coalesce, ‘ByKey operations (except for counting) like groupByKey and reduceByKey, and join operations like cogroup and join.


   10.1 性能影响

         shuffle 是expensive 操作,因为他涉及disk I/O ,data serialization 和network I/O。为shuffle组织数据,Spark采用任务集合-map tasks组织数据,reduce tasks汇集数据,此处提及的map和reduce来源于MapReduce,和Spark里面的map、reduce没有直接的关联。


        对于内部实现,对于map tasks产生的 数据,会被保存在内存中。这些数据基于目标partition排序并写到文件中。在reduce tasks中,任务读取相关的排序后的blocks。


        某种类型的shuffle操作可能消耗大量的heap memory,因为他们使用大量in-memory中的数据结构来组织数据记录;特别地,reduceByKey and aggregateByKey在map阶段创建这些数据结构,'ByKey  在reduce阶段产生这些数据结构,当数据不能全部被填充到内存中是,剩余的数据被分割到硬盘中,导致大量额外的硬盘IO负载及垃圾回收。


        shuffle也在硬盘中产生大量的交互性文件,这些文件一直被保存,直到相应的RDDs不再被使用。垃圾回收仅会在很长一段时间后才会进行。如果application一直对这些RDDs进行引用,或者GC没有频繁地将它们kick,这意味着,长时间运行spark jobs将消耗大量的磁盘空间。这些临时的存储目录通过spark.local.dir来配置。


        shuffle可以通过各种各样的配置参数来调整,可以参考Spark Configuration Guide


11. RDD Persistence

     将RDD驻留在内存或者cache中,以便后续的actions使用,可以加快处理速度(通常是10X以上的倍数)。对于迭代算法和快速交互应用,Caching是关键的工具。


    使用persist()或者 cache()方法,将RDD驻留到内存。


    此外,每个驻留的RDD可以使用不同的存储级别来存储。存储级别通过persist传递StorageLevel 对象来实现。cache方法使用默认的MEMORY_ONLY级别。存储级别详解:


   


Storage Level

Meaning

MEMORY_ONLY

Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.

MEMORY_AND_DISK

Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.

MEMORY_ONLY_SER

Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a​​fast serializer​​, but more CPU-intensive to read.

MEMORY_AND_DISK_SER

Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed.

DISK_ONLY

Store the RDD partitions only on disk.

MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.

Same as the levels above, but replicate each partition on two cluster nodes.

OFF_HEAP (experimental)

Store RDD in serialized format in ​​ Tachyon​​​. Compared to MEMORY_ONLY_SER, OFF_HEAP reduces garbage collection overhead and allows executors to be smaller and to share a pool of memory, making it attractive in environments with large heaps or multiple concurrent applications. Furthermore, as the RDDs reside in Tachyon, the crash of an executor does not lead to losing the in-memory cache. In this mode, the memory in Tachyon is discardable. Thus, Tachyon does not attempt to reconstruct a block that it evicts from memory. If you plan to use Tachyon as the off heap store, Spark is compatible with Tachyon out-of-the-box. Please refer to this​​page​​ for the suggested version pairings.



 在shuffle操作中,Spark会自动驻留一些交互数据。即使用户没有调用persist方法。这样做的目的,防止在进行shuffle过程中,有节点失效,导致需要重新计算整个输入。


11.1 如何选择存储级别

     针对上面提出的7种存储级别,在应用的时候,如何选择呢?


     这7中存储级别的划分,本质在在memory使用和CPU 利用率之间做的权衡。推荐按照下面的流程选择一个合适的级别:


  -- 如果RDD非常适合默认级别,则保持不动,这是最优解,使得在RDDs上进行的操作尽可能快。


  --选择MEMORY_ONY_SER并且选择一个快速的fast seriallization 库。


  -- Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from disk.


  --Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.


---in environments with high amounts of memory or multiple applications, the experimental OFF_HEAP mode has several advantages:在有大量内存或者多个application的环境中,


 It allows multiple executors to share the same pool of memory in Tachyon.


It significantly reduces garbage collection costs.


Cached data is not lost if individual executors crash.


11.2 removing data

  对于驻留的RDDs,Spark自动监测每个节点的cache使用情况,然后采用LRU算法移除老的数据集。也可以通过RDD.unpersist 方法来进行数据的移除。


12. 共享变量

     function:  


     method:  


     operation:


    通常,当一个function被传递给一个Spark operation在远程节点进行处理时,这个函数里面的变量都会被拷贝到远程节点中,远程节点在运行此函数时,使用的变量都是此节点本地的备份。这些变量被拷贝到每个机器上,变量的更新不会被传递回driver program。Spark提供了两种类型的变量: 广播(broadcast)变量和累加器accumulators.


 12.1 广播变量

     广播变量可以在每个机器上保存一份只读的缓冲变量。应用情况:给每个节点一个大的输入数据集的拷贝。Spark尝试使用更加有效的广播算法来分发广播变量,以减少对网络资源的消耗。


    Spark actions在一组stages中依次被执行。Spark自动广播每个stage的tasks所需要的common data。在运行每个task之前,这些被广播的数据以序列化(serialiezed)和反序列化(deserialized)的形式被缓存。这意味着,仅仅当tasks跨越多个stage并且需要相同数据时,显示创建广播变量才有用。


    广播变量通过SparkContext.broadcast(v)来创建。广播变量的值可以通过value方法来获取。代码示例:


>>> broadcastVar = sc.broadcast([1, 2, 3])
<pyspark.broadcast.Broadcast object at 0x102789f10>

>>> broadcastVar.value
[1, 2, 3]



    


   广播变量被创建以后,在集群上运行的任何函数都要使用此广播变量,而非原始的变量v。除此之外,对象v在它被广播之后不允许被修改,为了确保所有的节点得到此广播变量的同一个初始值。


 12.2 累加器

     累加器是仅仅被“加”的变量。他们被用来实现计数或者求和。Spark天生支持数字类型的累加器,编程人员可以添加对其他类型累加器的支持。如果一个累加器有名字,则它会被显示到Spark UI中。这对于理解正在运行的stage的进程是有用的。(Pyton不支持累加器命名?还是不支持啥?)


    从原始变量v创建累加器的方法为:SparkContext.accumulator(v). 而后集群中节点上的tasks可以使用add方法或者+=操作符(针对scala&Python0)来对此变量进行加操作。但是,这些节点上的tasks不能读取这个值,只有driver program可以读取累加器的值,使用value方法。


>>> accum = sc.accumulator(0)
Accumulator<id=0, value=0>

>>> sc.parallelize([1, 2, 3, 4]).foreach(lambda x: accum.add(x))
...
10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s

scala> accum.value
10



    上面的例子在集群环境中,将第一行修改为不定义累加器,测试运行结果;和目前的代码运行结果,两者的结果是不同的。这就是分布式的函数运算!脑袋中时刻要有变量被分发回收的概念!!


  通过继承父类AccumulatorParam,并实现此类接口的两个方法:zero 和addInPlace,可以创建自己的累加器类型。例如:


class VectorAccumulatorParam(AccumulatorParam):
def zero(self, initialValue):
return Vector.zeros(initialValue.size)

def addInPlace(self, v1, v2):
v1 += v2
return v1

# Then, create an Accumulator of this type:
vecAccum = sc.accumulator(Vector(...), VectorAccumulatorParam())




   累加器仅在actions阶段进行。Spark保证每个task对累加器的更新仅被应用一次。例如,重启的tasks将不能更新这个值。


   在transformation阶段,用户应该注意,如果tasks或者job stages被再次执行,每个task对累加器的更新可能 不应用不只一次。


  累加器不会改变spark的lazy模式。



accum = sc.accumulator(0)
def g(x):
accum.add(x)
return f(x)
data.map(g)
# Here, accum is still 0 because no actions have caused the `map` to be computed.




由于没有action动作,因而上面的累积器accum的值依然为0!




原文链接:


​原文:Spark Programming Guide​