Spark学习笔记之二

一.RDD的五个基本特征

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. This class contains the basic operations available on all RDDs, such as map, filter, and persist. In addition,[[org.apache.spark.rdd.PairRDDFunctions]] contains operations available only on RDDs of key-value pairs, such as groupByKey and join;[[org.apache.spark.rdd.DoubleRDDFunctions]] contains operations available only on RDDs of Doubles;[[org.apache.spark.rdd.SequenceFileRDDFunctions]] contains operations available on RDDs that can be saved as SequenceFiles.
All operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)]) through implicit.
//任何正确类型的RDD 均可通过隐式转换自动获得所有操作。
Internally, each RDD is characterized by five main properties:

  • A list of partitions
  • A function for computing each split
  • A list of dependencies on other RDDs
  • Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
  • Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

    All of the scheduling and execution in Spark is done based on these methods, allowing each RDD to implement its own way of computing itself. Indeed, users can implement custom RDDs (e.g. for reading data from a new storage system) by overriding these functions.

二.Spark基础知识

1.如果想往hdfs中写入一个文件,但是如果这个文件已经存在,则会报错。
2.触发Action时才会形成一个完整的DAG,然后需要提交到集群上执行。
3.任务在提交到集群之前,需要做一些准备工作,这些准备工作都是在Driver端进行的。准备工作包括:

  • 01.构建DAG(有向无环图)
  • 02.将DAG切分成1到多个stage
  • 03.任务分阶段执行,需要按顺序提交stage。因为后执行的stage需要依赖先执行stage的结果
  • 04.一个stage会生成多个task,然后提交到executor中。stage生成Task的数量跟该阶段RDD的分区数量一致。【为了充分并行化】

4.RDD之间存在着依赖关系,依赖关系有两种

  • 01,窄依赖:不存在shuffle操作
  • 02,宽依赖:存在shuffle

6.广播变量

  • 01.将Driver端的变量广播到属于自己的所有的Executor中。这样是为了减少worker节点中内存的消耗,和网络带宽的消耗[如果每个Executor进程都需要一个共享变量,那么一个worker中可能需要好几份变量值,这样就会增加网络带宽的消耗] 如果在Driver端创建一个变量,并且传入到RDD方法的函数中,方法中传入的函数是在Executor的Task中执行的,Driver会将这个在Driver中定义的变量方式给每一个Task。
  • 02.Spark提供了一个模板,可以继承这个模板,作为广播变量。
  • 03.示例如下:

7.若将数据collect到Driver端,再写到MySQL中,不是最佳选择,理由如下:
01,浪费网络带宽,
02,不能够并发地写到数据库
8.result.foreach(data =>{
//不能每写一条,就创建一个JDBC连接,导致消耗很大资源
//val conn : Connection = DriverManager.getConnection(“……”)
})
9.正确的操作应该是,一次拿出来一个分区,分区用迭代器引用。即每写一个分区,才会创建一个连接,这样会大大减少资源的消耗。
result.foreachPartition(part=>{
dataToMysql(part)
})

def dataToMysql(part:Iterator[(String,Int)]={
val conn : Connection = DriverManager.getConnection(“jdbc:mysql://localhost:3306/……charactorEncoding=utf8”,”root”,”“)//再传入用户名,密码
val prepareStatement = conn.prepareStatement(“Insert into access_log (province,counts)
values (?,?)”)

//写入数据
part.foreach(data =>{
    prepareStatement.setString(1,data._1)
    prepareStatement.setString(2,data._2)
    prepareStatement.executeUpdate()
})
prepareStatement.close()
conn.close()

}