第一章:Spark监控概述

  • 1.1 Spark监控概述
  • 1.2 $SPARK_HOME下进行配置
  • 1.3 Spark-shell本地测试

第二章:其它的监控方式

  • 2.1 REST API
  • 2.2 REST API的具体使用

第三章:Shared Variables

  • 3.1 Broadcast Variables
  • 3.1.1 普通的join
  • 3.1.2 BroadCastJoin
  • 3.2 Accumulator

一、Spark监控概述

  1. 启动spark-shell
  2. 执行sc.parallelize(List(1,2,3,4)).count
  3. 通过UI界面去查看,只有一个stage,4个task;需要关注启动时间多久,周期多久,GC time多久

场景:我们在本地运行的,推出spark这个运行界面;半夜跑spark任务,不管是任务结束还是任务挂了,就都结束了,什么信息也没了。

Every SparkContext launches a Web UI,by default port 4040,that displays useful information about the application. This includes:

  • A list of scheduler stages and tasks //展示stage和task信息
  • A summary of RDD sizes and memory usage //列出RDD大小和内存信息
  • Environment information //环境相关信息
  • Information about the running executors //executor信息
  1. You can access this interface by simply opening http://:4040 in a web browser. if multiple SparkContexts are running on the same host,they will bind to successive ports beginning with 4040(4041,4042,etc)
  • 你能够访问到这个界面通过在浏览器上打开4040端口(启动页就会有网址信息);如果你在相同的主机上启动了多个SparkContexts,他们的端口会依次递增
  1. Note that this information is only available for the duration of the application by default.
  • 这些信息仅仅只能在应用的生命周期中被访问到,意思是spark-shell关了,这些信息就访问不到了。
  1. To view the web UI after the fact, set spark.eventLog.enabled to true before starting the application. This configures Spark to log Spark events thst encode the information displayed in the UI to persisted storage.
  • 你要去看UI信息的话,在启动应用程序之前设置spark.eventLog.enabled参数为true;它会去记录这些事件信息,把这些信息保存在内存中。

默认场景下满足不了业务需求产生监控相关的东西。

1.1 Spark监控概述

  1. it is still possible to construct the UI of an application through Spark’s history server,provided that the application’s event logs exist. You can start the history server by executing.
  • 它能够通过Spark HistoryServer来访问UI,提供了应用程序的已经存在的事件日志
  • 进入到$SPARK_HOME/sbin目录下,使用命令:./sbin/start-history-server.sh
  1. This creates a web interface at http://:18080 by default, listing incomplete and completed applications and attempts.
  • 它会列出完成的、重试的、未完成的应用程序信息,在默认的(ip/hostaname):18080端口上
  1. when using the file-system provider class (see spark.history.provider below),the base logging directory must be supplied in the spark.history.fs.logDirectory configuration option,and should contain sub-directories that each represents an application’s event log.
  • 当使用文件系统提供类时(查看 spark.history.provider),这个基础日志目录一定需要被应用通过spark.history.fs.logDirectory 这个参数进行配置,能够包含子目录(每一个子目录都比松hi一个应用程序的event log)
  1. The spark jobs themselves must be configured to log events, and to log them to the same shared,writable directory. For example,if the server was configured with a log directory of hdfs://namenode/shared/spark-logs,then the client-side options would be:
    第一步:spark.eventLog.enabled true
    第二步:spark.eventLog.dir hdfs://namenode/shared/spark-logs //开启以后,要设置hdfs存储目录

Environment Variables(环境变量)

  1. SPARK_HISTORY_OPTS spark.history.* configuration options for the history server (default: none).
  • 以spark.history开头的都需要配置到 SPARK_HISTORY_OPTS中

1.2 Spark_home下进行配置

Spark History Server Configuration Options

property name

default

spark.history.provider

org.apache.spark.deploy.history.FsHistoryProvider

spark.history.fs.logDirectory

file:/tmp/spark-events

spark.history.fs.update.interval

10s

spark.history.retainedApplications

50

spark.history.fs.cleaner.enabled

false

spark.history.fs.cleaner.interval

1d

spark.history.fs.cleaner.maxAge

7d

进入到SPARK_HOME/conf下配置:

  1. cd $SPARK_HOME/conf 下,拷贝一份文件:cp spark-defaults.conf.template spark-defaults.conf;对这份文件进行编辑:vi spark-defaults.conf
  2. spark监控界面怎么看日志 spark 内存监控_UI

  3. cp spark-env.sh.template spark-env.sh;进入到编辑模式:./SPARK_HISTORY_OPTS
  • SPARK_HISTORY_OPTS = “-Dspark.history.fs.logDirectory=hdfs://hadoop002:9000/g6_directory”
  1. ./start-history-server.sh 注意启动前要保证这个/g6_directory日志目录在hdfs上有;
  • 要去到$SPARK_HOME/logs下打印查看日志,查看是否正常启动。

访问hadoop002:18080端口

No completed applications found! 还会有如下提示:

  1. Did you specify the corrrect logging directory ? Please verify your setting of spark.history.fs.logDirectory listed above and Whether you have the permissions to access it.
  • 进行提示你的目录指定是否正确,你是否有权限访问
  1. it is also possible that your application did not run to completion or did not stop the SparkContext.
  • 你的应用程序没有运行成功或者你的sc没有停止掉

1.3 Spark-shell 本地测试

  1. 本地启动spark-shell,运行sc.parallelize(list(1,2,3,4)).count;再退出当前sc
  2. 去到hadoop002:18080端口上查看是否是否有信息;因为我们跑在本地,所以端口上的App ID名字显示时local
  3. 所有运行的信息全都有,和在hadoop002:4040端口下显示的页面是一样的

spark监控界面怎么看日志 spark 内存监控_UI_02

注意事项:

  • Note that in all of these UIs,the tables are sortable by clicking their headers,making it easy to identify slow tasks,data skew,etc
  • 我们在18080端口上点击头部是可以进行排序的,非常容易去鉴别数据倾斜;进入到tasks,直接点击duration。

Note:

  1. The history server displays both completed and incomplete Spark jobs. If an application makes multiple attemmpts after failures,the failed attempts will be displayed,as well as any ongoing incomplete attempt or the final successful attempt.
  • the history server 显示的包括完成和未完成的spark作业
  1. Incomplete applications are only Updated intermittently. The time between updates is defined by the interval between checks for changed files(spark.history.fs.update.interval). On large clusters, the update interval may be set to large values. The way to view a running application is actually to view its web UI.
  • 未完成的作业会根据事件进行控制
  1. One way to signal the completion of a spark job is to stop the Spark Context explicitly(sc.stop()),or in Python using the with SparkContext() as sc : construct to handle the Spark Context setup and tear down
  • 通过这种方式sc.stop()把spark作业停下来

2.1 REST API的方式

  1. In addition to viewing the metrics in the UI, thet are also available as JSON. This gives developers an easy way to create new visualizations and monitoring tools for Spark. This JSON is available for both running applications, and in the history server. The endpoints are mounted at /api/v1. for the history server, they would typically be accessible at http://:18080/api/v1, and for a running application, at http://localhost:4040/api/v1
  • 自定以创建一个spark的监控程序,运行中的程序或者history server都能访问到JSON,对于正在运行的应用程序,可以使用:http://:18080/api/v1
  1. In the API,an application is referenced by its application ID, [app-id] when running on YARN, each application may have multiple attempts, but there are attempt IDs only for applications in cluster mode, not applications in client mode. Applications in YARN cluster mode can be identified by their [attempt-id]. In the API listed below, when running in YARN cluster mode, [app-id] will actually be [base-app-id], where [base-app-id] is the YARN application ID.
  • 一个应用程序通过application ID被引用;当我们spark on YARN的时候,每一个应用程序都能有多次尝试,多次尝试ID只针对cluster模式,不对client模式生效。

直接进入到UI界面查看:

2.2 REST API的具体使用

http://hadoop002:18080/api/v1/applications:返回的就是一个JSON串,如果有多个作业,就是有多个JOSN串。

  1. hadoop002:18080/api/v1 此时是没有任何显示的
  2. hadoop002:18080/api/v1/applications 返回一个JSON数组,拿到的是所有的应用程序

spark监控界面怎么看日志 spark 内存监控_spark_03


4. hadoop002:18080/api/v1/applications?status=[completed|running] 可以跟上状态,比如是否有正在运行的applications

spark监控界面怎么看日志 spark 内存监控_spark_04


5. hadoop002:18080/api/v1/applications/[app-id]/jobs 列出Jobs下的信息

spark监控界面怎么看日志 spark 内存监控_spark监控界面怎么看日志_05


6. /applications/[app-id]/jobs/[job-id] Details for the given job.

spark监控界面怎么看日志 spark 内存监控_hadoop_06


一般使用场景:服务搭好,前端配合(设计好UI),告诉前端接口。

Executor Task Metrics(指标信息)

Metrics:一般用不到

小结:主要关注点:HistoryServer和REST API

  1. jps命令查看到HistoryServer就是一个Java进程;ps -ef|grep 端口号
  2. spark监控界面怎么看日志 spark 内存监控_UI_07

  3. HistoryServer不用的话使用命令停止:./stop-history-server.sh
  4. 记录的日志保存在哪:hdfs dfs -ls hdfs://hadoop002:9000/g6_directory
  • hdfs dfs -text hdfs://hadoop002:9000/g6_directory/app-id
  • 这一串信息就是JSON,我们在REST API上查看到的JSON信息就是在此处解析出来的

第三章:(共享变量)Shared Variables

定义:

  • Normally, When a function passed to Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function .These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program . Supporting general , read-write shared variables across tasks would be inefficient. However , Spark does provide two limited types of shared
  • 当一个算子再map或reduce端,执行在executor中;所有的变量都有一个副本,会把变量拷贝到每一台机器中去;默认情况多线程共享操作变量–效率是不高的。
val values = new HashMap()
val rdd = ....
rdd.foreach( x => {
		value  //.....				在算子里面用到了外面的一个属据
})
  • 在算子中若直接操作外部的变量,spark会将普通的外部变量拷贝到每一个task上,这样不仅会吃很多内存,还会出现同时各自更改该变量如何保证都生效且不冲突的问题。spark引进了: broadcast variables(广播大变量) and accumulators(累加器)两个功能。

3.1 广播变量

定义:
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks

  • 广播变量使用的是每一台机器一个副本,而不是每一个task一个copy
  • 假设value的值有10M,有1000个task, 普通的执行方式,算子内部使用了外部的变量,这个变量
    必须要拷贝到每一个task上去;所以就是10G;内存中耗费了太多资源。

引出广播变量:==> 广播变量每一个机器一个副本,而不是每一个task一个副本。

spark-shell中测试广播变量:

  1. val broadcastVar = sc.broadcast (Array(1,2,3,4))
  2. broadcastVar.value
  • 这种方式在生产上永不了

3.1.1 普通的join

场景一:

info1

info2

G601 阿呆

G601 南京大学

G602 君永夜

G602 苏州大学

G622 血狼

G638 三江学院

操作:info1.join(info2)

package Sparkcore04

import org.apache.spark.{SparkConf, SparkContext}

object AccumulatorApp {
  def main(args: Array[String]): Unit = {
      val sparkConf = new SparkConf().setAppName("AccumulatorApp").setMaster("local[2]")
       val sc = new SparkContext(sparkConf)

    commonJoin(sc)
  sc.stop()
  }


      def commonJoin(sc:SparkContext):Unit={
        val info1 = sc.parallelize(Array(("G601","阿呆"),("G602","君永夜"),("G622","血狼")))

        val info2 = sc.parallelize(Array(("G601","南京大学"),("G602","苏州大学"),("622","三江学院")))


    //2个对应两个可以直接使用join
       info1.join(info2).foreach(println)

      }
}

输出结果:
(G601,(阿呆,南京大学))
(G602,(君永夜,苏州大学))

场景二:

info1

info2

G601 阿呆

G601 南京大学 24

G602 君永夜

G602 苏州大学 25

G622 血狼

G638 三江学院 27

def commonJoin(sc:SparkContext):Unit={
        val info1 = sc.parallelize(Array(("G601","阿呆"),("G602","君永夜"),("G622","血狼")))

        val info2 = sc.parallelize(Array(("G601","南京大学","24"),("G602","苏州大学","25"),("622","三江学院","27")))
            .map(x => (x._1,x))			//进行分割

    //2个对应两个可以直接使用join
       info1.join(info2).foreach(println)

输出:对比两段代码不同的地方
(G602,(君永夜,(G602,苏州大学,25)))
(G601,(阿呆,(G601,南京大学,24)))

我们想要实现的结果是:G601,阿呆,南京大学
所以继续进行代码修改:
得到了我们想要的结果:

//2个对应两个可以直接使用join
       info1.join(info2)
          .map(x =>{
            x._1 + "," + x._2._1 + "," + x._2._2._2
          })
         .foreach(println)

输出结果:
G601,阿呆,南京大学
G602,君永夜,苏州大学

让线程睡一会:

  • commonJoin(sc)
    Thread.sleep(2000000)
    sc.stop()
    在浏览器中查看UI信息,localhost:4041

    解析:
  1. stage0是map
  2. stage1是parallelize
  3. stage2是join

BroadCast在生产上的使用场景:

3.1.12 Broadcast Join

  1. broadcast出去以后就不会再用join来实现
  2. 大表的数据读取出来一条就和广播出去的小表的记录做匹配
package Sparkcore04

import org.apache.spark.{SparkConf, SparkContext}

object BroadCastJoin {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("AccumulatorApp").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)

    //这种方式是没有shuffle的,前提是小表的数据要小。
    broadcastJoin(sc)
    Thread.sleep(200000)
    sc.stop()
  }
      def broadcastJoin(sc:SparkContext):Unit ={

        //小 ==> 广播出去
        val info1 = sc.parallelize(Array(("G601","阿呆"),("G602","君永夜"),("G622","血狼")))
          .collectAsMap()     //Map() map.get访问到key


        //从Driver端广播出去
        val info1BroadCast = sc.broadcast(info1)

        //大
        val info2 = sc.parallelize(Array(("G601","南京大学","24"),("G602","苏州大学","25"),("G622","三江学院","27"),("G652","中国矿业大学","27")))
          .map(x => (x._1,x))

        info2.mapPartitions( x =>{
          //拿取info1中的信息
          val broadcastMap = info1BroadCast.value

          //遍历info1中的信息,如果1中包含key,value.2拿到的是info2中的第二个字段
          for((key,value) <- x if(broadcastMap.contains(key)))
            yield(key,broadcastMap.get(key).getOrElse(""),value._2)
        }).foreach(println)


  }
}

spark监控界面怎么看日志 spark 内存监控_hadoop_08

3.2 计数器(Accumulator)

  • Accumulators are variables that are only “added” to through an commutative operation and can therefore be efficiently supported in parallel. They can be used to implement counters (as in MapReduce) or sums. Spark natively supports accumulators of numeric types, and programmers can add support for new types.
  • accumulator仅仅只支持"add"操作,它底层实现了一个counter,spark原生只是int型的累加操作或者自定义的。

spark-shell中进行测试:

1、scala> val accum = sc.longAccumulator("John Accumulator")
accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0, name: Some(My Accumulator), value: 0)

2、scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum.add(x))
...
10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s

3、scala> accum.value
res2: Long = 10

在UI界面上的stage中查看,有计数器信息;在每一个task中共享;底层对应4个Accumulators