1、Spark概述
Spark是一种基于内存的快速、通用、可扩展的大数据分析计算引擎。在绝大多数的数据计算场景中,Spark确实会比MapReduce更有优势。但是Spark是基于内存的,所以在实际的生产环境中,由于内存的限制,可能会由于内存资源不够导致Job执行失败,此时,MapReduce其实是一个更好的选择,所以Spark并不能完全替代MR。
Spark Core:
Spark Core中提供了Spark最基础与最核心的功能,Spark其他的功能如:Spark SQL,Spark Streaming,GraphX, MLlib都是在Spark Core的基础上进行扩展的
Spark SQL:
Spark SQL是Spark用来操作结构化数据的组件。通过Spark SQL,用户可以使用SQL或者Apache Hive版本的SQL方言(HQL)来查询数据。
Spark Streaming:
Spark Streaming是Spark平台上针对实时数据进行流式计算的组件,提供了丰富的处理数据流的API。
Spark MLlib:
MLlib是Spark提供的一个机器学习算法库。MLlib不仅提供了模型评估、数据导入等额外的功能,还提供了一些更底层的机器学习原语。
Spark GraphX:
GraphX是Spark面向图计算提供的框架与算法库。
2、Spark快速上手
2.1、Local模式
spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local[2] \
./examples/jars/spark-examples_2.12-3.0.0.jar \
10
1)class表示要执行程序的主类,此处可以更换为咱们自己写的应用程序;
2)master local[2] 部署模式,默认为本地模式,数字表示分配的虚拟CPU核数量(即线程数),local[*]表示最大虚拟核数;
3)spark-examples_2.12-3.0.0.jar 运行的应用类所在的jar包,实际使用时,可以设定为咱们自己打的jar包;
4)数字10表示程序的入口参数,用于设定当前应用的任务数量
注意:①jar包一定要包含class文件,②程序输入文件和jar包的路径是相对spark-submit执行时所在的位置
2.2、Running Spark on YARN
配置Spark on Yarn和Spark历史服务器
[atguigu@hadoop102 conf]$ cat spark-env.sh
#!/usr/bin/env bash
# export JAVA_HOME=/opt/module/jdk1.8.0_212
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# This file is sourced when running various Spark programs.
# Copy it as spark-env.sh and edit that to configure Spark for your site.
# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program
# Options read by executors and drivers running inside the cluster
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
# - MESOS_NATIVE_JAVA_LIBRARY, to point to your libmesos.so if you use Mesos
# Options read in YARN client/cluster mode
# - SPARK_CONF_DIR, Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - YARN_CONF_DIR, to point Spark towards YARN configuration files when you use YARN
# Spark on Yarn时指定Yarn的配置文件
YARN_CONF_DIR=/opt/module/hadoop-3.1.3/etc/hadoop
# - SPARK_EXECUTOR_CORES, Number of cores for the executors (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Executor (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Driver (e.g. 1000M, 2G) (Default: 1G)
# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_HOST, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
export SPARK_HISTORY_OPTS="
-Dspark.history.ui.port=18080
-Dspark.history.fs.logDirectory=hdfs://hadoop102:9820/directory
-Dspark.history.retainedApplications=30"
# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_DAEMON_CLASSPATH, to set the classpath for all daemons
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers
# Options for launcher
# - SPARK_LAUNCHER_OPTS, to set config properties and Java options for the launcher (e.g. "-Dx=y")
# Generic options for the daemons used in the standalone deploy mode
# - SPARK_CONF_DIR Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - SPARK_LOG_DIR Where log files are stored. (Default: ${SPARK_HOME}/logs)
# - SPARK_PID_DIR Where the pid file is stored. (Default: /tmp)
# - SPARK_IDENT_STRING A string representing this instance of spark. (Default: $USER)
# - SPARK_NICENESS The scheduling priority for daemons. (Default: 0)
# - SPARK_NO_DAEMONIZE Run the proposed command in the foreground. It will not output a PID file.
# Options for native BLAS, like Intel MKL, OpenBLAS, and so on.
# You might get better performance to enable these options if using native BLAS (see SPARK-21305).
# - MKL_NUM_THREADS=1 Disable multi-threading of Intel MKL
# - OPENBLAS_NUM_THREADS=1 Disable multi-threading of OpenBLAS
[atguigu@hadoop102 conf]$ cat spark-defaults.conf
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.
# Example:
# spark.master spark://master:7077
spark.eventLog.enabled true
spark.eventLog.dir hdfs://hadoop102:9820/directory
# spark.serializer org.apache.spark.serializer.KryoSerializer
# spark.driver.memory 5g
# spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
# 历史服务器的主机地址为主节点的主机名hadoop102
spark.yarn.historyServer.address=hadoop102:18080
spark.history.ui.port=18080
[atguigu@hadoop102 conf]$ sbin/start-dfs.sh
[atguigu@hadoop102 conf]$ hadoop fs -mkdir /directory
集群模式和客户端模式提交应用举例
bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
./examples/jars/spark-examples_2.12-3.0.0.jar \
10
bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode client \
./examples/jars/spark-examples_2.12-3.0.0.jar \
10
3、Spark运行架构
Spark应用程序提交到Yarn环境中执行的时候,一般会有两种部署执行的方式:Client和Cluster。两种模式主要区别在于:Driver程序的运行节点位置。Client模式将用于监控和调度的Driver模块在客户端执行,而不是在Yarn中,所以一般用于测试。Cluster模式将用于监控和调度的Driver模块启动在Yarn集群资源中执行,所以一般应用于实际生产环境。
RDD(Resilient Distributed Dataset)叫做弹性分布式数据集,是Spark中最基本的数据处理模型。代码中是一个抽象类,它代表一个弹性的、不可变、可分区、里面的元素可并行计算的集合。
从计算的角度来讲,数据处理过程中需要计算资源(内存 & CPU)和计算模型(逻辑)。执行时,需要将计算资源和计算模型进行协调和整合。
Spark框架在执行时,先申请资源,然后将应用程序的数据处理逻辑分解成一个一个的计算任务。然后将任务发到已经分配资源的计算节点上, 按照指定的计算模型进行数据计算。最后得到计算结果。RDD是Spark框架中用于数据处理的核心模型,接下来我们看看,在Yarn环境中,RDD的工作原理:
从以上流程可以看出RDD在整个流程中主要用于将逻辑进行封装,并生成Task发送给Executor节点执行计算,RDD的分区数目决定了总的Task数目,下面给出分区数的确定原理。
map、reduce的并行度设定
# mapred.map.tasks大于实际需要的maptask时才生效
--jobconf mapred.map.tasks=20
# 设置就生效
--jobconf mapred.reduce.tasks=5
当Spark(Map)读取文件作为输入时,会根据具体数据格式对应的InputFormat进行解析,一般是将若干个Block合并成一个输入分片,称为InputSplit,注意InputSplit不能跨越文件。
Task被执行的并发度 = Executor数目 * 每个Executor核数(=虚拟core总个数)
在spark调优中,增大RDD分区数目(map:InputSplit决定;reduce:shuffle决定),可以增大任务并行度(避免资源闲着)。
单个RDD执行轮次 = Task被执行的并发度 / RDD分区数目
Spark(快,资源要求高:CPU、内存)和MapReduce(慢,资源要求低:CPU、内存)异同
① 计算不涉及与其他节点进行数据交换时,Spark可以在内存中一次性完成这些操作;如果计算过程中涉及数据交换,Spark 也是会把 shuffle 的数据写磁盘的!
② Spark的DAGScheduler可以实现map->reduce->reduce
③ Spark是一次性申请资源,MapReduce逐次申请资源
④ Spark编程模型RDD/DataFrame/DataSet更加灵活
⑤ MapReduce任务在启动时已经在JVM内指定了最大内存,不能超过指定的最大内存;Spark在超过指定最大内存后,会使用操作系统内存,既保证了内存的基本使用,又避免了提早分配过多内存带来的资源浪费
⑥ MapReduce中一个进程运行一个task,按序执行;Spark中一个线程运行一个task,增加了并行度。
Spark除Map和Reduce外(并不是算法,只是提供了Map阶段和Reduce阶段,两个阶段提供了很多算法:Map阶段的map、flatMap、filter、keyBy等,Reduce阶段的reduceByKey、sortByKey、mean、gourpBy、sort等),还支持RDD(RDD封装了计算逻辑,并不保存数据)/DataFrame/DataSet等多种数据模型操作,编程模型更加灵活。Spark在超过指定最大内存后,会使用操作系统内存,既保证了内存的基本使用,又避免了提早分配过多内存带来的资源浪费,修改hadoop配置文件/opt/module/hadoop-3.1.3/etc/hadoop/yarn-site.xml:
<!--是否启动一个线程检查每个任务正使用的物理内存量,如果任务超出分配值,则直接将其杀掉,默认是true -->
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<!--是否启动一个线程检查每个任务正使用的虚拟内存量,如果任务超出分配值,则直接将其杀掉,默认是true -->
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
Yarn的内存超出指定的 yarn.nodemanager.resource.memory-mb 的解决过程Spark内存管理之堆内/堆外内存原理详解
4、Spark核心编程
log4j的配置文件参考这篇博客:031 Log4j日志框架
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.jieky.studySpark</groupId>
<artifactId>studySpark</artifactId>
<version>1.0-SNAPSHOT</version>
<inceptionYear>2008</inceptionYear>
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
</properties>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>3.0.0</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
</dependency>
<!--slf4j-log4j12是log4j的1.X版本,log4j-slf4j-impl是log4j的2.X版本-->
<!--这个依赖需要放在桥接器依赖之前,不然会报错-->
<!--The Apache Log4j SLF4J API binding to Log4j 2 Core-->
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-slf4j-impl</artifactId>
<version>2.9.1</version>
</dependency>
<!-- 面对多种日志框架同时存在的问题,Ceki 的 Slf4j 给出了解决方案,就是下文
的桥接( Bridging legacy),简单来说就是劫持所有第三方日志输出并重定
向至 SLF4j,最终实现统一日志上层API(编码)与下层实现(输出日志位置、格式统一)-->
<!--JCL 1.2 implemented over SLF4J-->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>jcl-over-slf4j</artifactId>
<version>1.7.36</version>
</dependency>
<!--JUL to SLF4J bridge-->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>jul-to-slf4j</artifactId>
<version>1.7.36</version>
</dependency>
</dependencies>
</project>
package com.jieky.studySpark
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object App {
def main(args: Array[String]): Unit = {
// 设置并行度,local[*]表示并行度为本地机器的最大虚拟核数(线程)
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
// 手动设置并行度,即能够同时运行的task数量(还是线程)
sparkConf.set("spark.default.parallelism", "4")
val sparkContext = new SparkContext(sparkConf)
// 设置分区,一个分区对应一个线程,一个线程可被多个分区重复使用
// 极端情况:只有一个线程,但有多个分区,分区中数据会串行执行
val dataRDD: RDD[Int] = sparkContext.makeRDD(List(1,2,3,4), 5)
val fileRDD: RDD[String] = sparkContext.textFile("data",6)
dataRDD.collect().foreach(println)
fileRDD.collect().foreach(println)
sparkContext.stop()
}
}
4.1、数据可以按照并行度的设定进行数据的分区操作
val rdd1 : RDD[Int] = sc.makeRDD(Seq(1,2,3,4,5))
def parallelize[T: ClassTag](
seq: Seq[T],
numSlices: Int = defaultParallelism): RDD[T] = withScope {
assertNotStopped()
new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())
}
override def getPartitions: Array[Partition] = {
val slices = ParallelCollectionRDD.slice(data, numSlices).toArray
slices.indices.map(i => new ParallelCollectionPartition(id, i, slices(i))).toArray
}
def slice[T: ClassTag](seq: Seq[T], numSlices: Int): Seq[Seq[T]] = {
if (numSlices < 1) {
throw new IllegalArgumentException("Positive number of partitions required")
}
// Sequences need to be sliced at the same set of index positions for operations
// like RDD.zip() to behave as expected
def positions(length: Long, numSlices: Int): Iterator[(Int, Int)] = {
(0 until numSlices).iterator.map { i =>
val start = ((i * length) / numSlices).toInt
val end = (((i + 1) * length) / numSlices).toInt
(start, end)
}
}
seq match {
case r: Range =>
positions(r.length, numSlices).zipWithIndex.map { case ((start, end), index) =>
// If the range is inclusive, use inclusive range for the last slice
if (r.isInclusive && index == numSlices - 1) {
new Range.Inclusive(r.start + start * r.step, r.end, r.step)
}
else {
new Range(r.start + start * r.step, r.start + end * r.step, r.step)
}
}.toSeq.asInstanceOf[Seq[Seq[T]]]
case nr: NumericRange[_] =>
// For ranges of Long, Double, BigInteger, etc
val slices = new ArrayBuffer[Seq[T]](numSlices)
var r = nr
for ((start, end) <- positions(nr.length, numSlices)) {
val sliceSize = end - start
slices += r.take(sliceSize).asInstanceOf[Seq[T]]
r = r.drop(sliceSize)
}
slices
case _ =>
val array = seq.toArray // To prevent O(n^2) operations for List etc
positions(array.length, numSlices).map { case (start, end) =>
array.slice(start, end).toSeq
}.toSeq
}
}
4.2、Spark的文件读取底层就是Hadoop的文件读取,最终的分区数量就是hadoop读取文件的切片数
# 设置预计的最小切片数(分区数)
val rdd: RDD[String] = sc.textFile("data/word*.txt", 2)
def textFile(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
assertNotStopped()
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path)
}
public class TextInputFormat extends FileInputFormat<LongWritable, Text>
public InputSplit[] getSplits(JobConf job, int numSplits)
throws IOException {
StopWatch sw = new StopWatch().start();
FileStatus[] files = listStatus(job);
// Save the number of input files for metrics/loadgen
job.setLong(NUM_INPUT_FILES, files.length);
long totalSize = 0; // compute total size
for (FileStatus file: files) { // check we have valid files
if (file.isDirectory()) {
throw new IOException("Not a file: "+ file.getPath());
}
totalSize += file.getLen();
}
# 预计每一个分区处理数据的字节大小
long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);
long minSize = Math.max(job.getLong(org.apache.hadoop.mapreduce.lib.input.
FileInputFormat.SPLIT_MINSIZE, 1), minSplitSize);
// generate splits
ArrayList<FileSplit> splits = new ArrayList<FileSplit>(numSplits);
NetworkTopology clusterMap = new NetworkTopology();
for (FileStatus file: files) {
Path path = file.getPath();
long length = file.getLen();
if (length != 0) {
FileSystem fs = path.getFileSystem(job);
BlockLocation[] blkLocations;
if (file instanceof LocatedFileStatus) {
blkLocations = ((LocatedFileStatus) file).getBlockLocations();
} else {
blkLocations = fs.getFileBlockLocations(file, 0, length);
}
if (isSplitable(fs, path)) {
long blockSize = file.getBlockSize();
# 计算最终合适的切片大小,minSize默认值是1
long splitSize = computeSplitSize(goalSize, minSize, blockSize);
long bytesRemaining = length;
# SPLIT_SLOP = 1.1,10%以内不创建新的分区
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
String[][] splitHosts = getSplitHostsAndCachedHosts(blkLocations,
length-bytesRemaining, splitSize, clusterMap);
splits.add(makeSplit(path, length-bytesRemaining, splitSize,
splitHosts[0], splitHosts[1]));
bytesRemaining -= splitSize;
}
if (bytesRemaining != 0) {
String[][] splitHosts = getSplitHostsAndCachedHosts(blkLocations, length
- bytesRemaining, bytesRemaining, clusterMap);
splits.add(makeSplit(path, length - bytesRemaining, bytesRemaining,
splitHosts[0], splitHosts[1]));
}
} else {
String[][] splitHosts = getSplitHostsAndCachedHosts(blkLocations,0,length,clusterMap);
splits.add(makeSplit(path, 0, length, splitHosts[0], splitHosts[1]));
}
} else {
//Create empty hosts array for zero length files
splits.add(makeSplit(path, 0, length, new String[0]));
}
}
sw.stop();
if (LOG.isDebugEnabled()) {
LOG.debug("Total # of splits generated by getSplits: " + splits.size()
+ ", TimeTaken: " + sw.now(TimeUnit.MILLISECONDS));
}
return splits.toArray(new FileSplit[splits.size()]);
}
4.3、Spark的分区数据的划分由hadoop决定(读取文件时)
val conf = new SparkConf().setMaster("local[*]").setAppName("RDD")
val sc = new SparkContext(conf)
// TODO 1. 分区数据的处理也是由Hadoop决定的。
// TODO 2. hadoop在计算分区时和处理数据时的逻辑不一样。
// TODO 3. Spark读取文件数据底层使用的就是hadoop读取的,所以读取规则用的是hadoop
// 3.1 hadoop读取数据是按行读取的,不是按字节读取
// 3.2 hadoop读取数据是偏移量读取的
// 3.3 hadoop读取数据时,不会重复读取相同的偏移量
val rdd = sc.textFile("data/word.txt", 3)
rdd.saveAsTextFile("output")
sc.stop()
/*
文件中的数据:1\r\n、2\r\n、3\r\n
1@@ => 012
2@@ => 345
3 => 6
计算读取偏移量 => 数据
[0, 3] => [12]
[3, 6] => [3]
[6, 7] => []
*/
4.3、算子(分布式计算和单机计算是不同的)
数据分区数一般不变
数据所在分区一般不变
数据分区内有序、分区间无序
分区内单个数据处理逻辑(RDD)有序
分区内多个数据间处理逻辑(RDD)无序
package com.jieky.studySpark
import org.apache.spark.{SparkConf, SparkContext}
object App {
def main(args: Array[String]): Unit = {
// 设置并行度,local[*]表示并行度为本地机器的最大虚拟核数(线程)
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
// 手动设置并行度,即能够同时运行的task数量(还是线程)
sparkConf.set("spark.default.parallelism", "4")
val sparkContext = new SparkContext(sparkConf)
// map是对rdd中的每一个元素进行操作;若是执行数据插入数据库操作,每条数据的插入都会连接一次数据库
println("1.map--------------------------------")
val aa = sparkContext.parallelize(1 to 9, 4)
val aa_res = aa.map(temp => (temp, temp*2))
println(aa.getNumPartitions)
println(aa_res.collect().mkString)
// mapPartitions则是对rdd中的每个分区的迭代器进行操作;若是执行数据插入数据库操作,每个partition的插入都会连接一次数据库
println("2.mapPartitions--------------------------------")
val bb = sparkContext.parallelize(1 to 9, 4)
val bb_res =bb.mapPartitions(temp =>{
var result = List[(Int,Int)]()
while (temp.hasNext){
val cur = temp.next()
result = (cur,cur*2)::result
}
result.iterator
})
println(bb.getNumPartitions)
println(bb_res.collect().mkString)
// mapPartionsWithIndex跟mapPatition的区别是输入的值多出一个Index
println("3.mapPartitionsWithIndex--------------------------------")
val cc = sparkContext.parallelize(1 to 9, 4)
val cc_res =bb.mapPartitionsWithIndex((index,temp) =>{
var result = List[(Int,Int,Int)]()
while (temp.hasNext){
val cur = temp.next()
result = (index,cur,cur*2)::result
}
result.iterator
})
println(cc.getNumPartitions)
println(cc_res.collect().mkString)
sparkContext.stop()
}
}
1.map--------------------------------
4
(1,2)(2,4)(3,6)(4,8)(5,10)(6,12)(7,14)(8,16)(9,18)
2.mapPartitions--------------------------------
4
(2,4)(1,2)(4,8)(3,6)(6,12)(5,10)(9,18)(8,16)(7,14)
3.mapPartitionsWithIndex--------------------------------
4
(0,2,4)(0,1,2)(1,4,8)(1,3,6)(2,6,12)(2,5,10)(3,9,18)(3,8,16)(3,7,14)
深入解读 Spark 宽依赖和窄依赖(ShuffleDependency & NarrowDependency)
简单来说,NarrowDependency 为 parent RDD 的一个或多个分区的数据全部流入到 child RDD 的一个或多个分区,而 ShuffleDependency 则为 parent RDD 的每个分区的每一部分,分别流入到 child RDD 的不同分区。
Spark 之所以要将依赖关系分为 NarrowDependency 和 ShuffleDependency ,是可以更好的将各种依赖类型进行分类,明确数据怎么流出流入,从而更容易生成对应的物理执行计划。NarrowDependency 不需要 shuffle 操作,并且可以用于流式操作(pipeline)。ShuffleDependency 则需要进行 shuffle 操作,有 shuffle 的地方需要划分不同的 stage。
转换算子:Transformation,懒执行,需要Action触发执行
①窄依赖转换算子:filter、map、flatMap、sample、union、intersection、mapPartitions、mapPartitionsWithIndex、zip
②宽依赖转换算子:sortBy、sortByKey、reduceByKey、join、leftOuterJoin、rightOuterJoin、fullOuterJoin、distinct、cogroup、repartition
③coalesce算子可以增多分区,也可以减少分区,默认没有shuffle,有shuffle就是宽依赖(repartition算子是coalesce接口中shuffle为true的实现),没shuffle就是窄依赖。
行动算子:Action,触发transformation类算子执行,一个application中有一个action算子就有一个job
①清单:foreach、count、collect、first、take、foreachPartition、reduce、countByKey、countByValue
持久化算子:
①清单:cache、persist
在使用时map会将一个长度为N的RDD转换为另一个长度为N的RDD(单个元素为序列化对象);而flatMap会在map操作的基础上,再把这N个序列化对象合并,成为长度为1的RDD结果集(单个元素为序列化对象)。
package com.jieky.studySpark
import org.apache.spark.{SparkConf, SparkContext}
object App {
def main(args: Array[String]): Unit = {
// 设置并行度,local[*]表示并行度为本地机器的最大虚拟核数(线程)
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
// 手动设置并行度,即能够同时运行的task数量(还是线程)
sparkConf.set("spark.default.parallelism", "4")
val sc = new SparkContext(sparkConf)
val rdd = sc.parallelize(List("coffee panda","happy panda","happiest panda party"),20)
val temp1 = rdd.map(x=>x.split("\\s+")).collect()
// 输出对象个数,对象数据类型
println(temp1.size,temp1.getClass.getSimpleName())
temp1.foreach(_.foreach(println(_)))
println("-"*20)
val temp2 = rdd.flatMap(x=>x.split("\\s+")).collect()
// 输出对象个数,对象数据类型
println(temp2.size,temp2.getClass.getSimpleName())
temp2.foreach(println(_))
sc.stop()
}
}
(3,String[][])
coffee
panda
happy
panda
happiest
panda
party
--------------------
(7,String[])
coffee
panda
happy
panda
happiest
panda
party
spark partition 理解 / coalesce 与 repartition的区别
repartition只是coalesce接口中shuffle为true的实现
①.多个executor,如果结果产生的文件数要比源RDD partition少,用coalesce(shuffle参数为false)是实现不了的,例如有4个小文件(4个partition),你要生成5个文件用coalesce实现不了,也就是说不产生shuffle,无法实现文件数变多。
② .如果你只有1个executor(1个core),源RDD partition有5个,你要用coalesce产生2个文件。那么他是预分partition到executor上的,例如0-2号分区在先executor上执行完毕,3-4号分区再次在同一个executor执行。其实都是同一个executor但是前后要串行读不同数据。与用repartition(2)在读partition上有较大不同(串行依次读0-4号partition 做%2处理)。
package com.jieky.studySpark
import org.apache.spark.{SparkConf, SparkContext}
object App {
def main(args: Array[String]): Unit = {
// 设置并行度,local[*]表示并行度为本地机器的最大虚拟核数(线程)
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
// 手动设置并行度,即能够同时运行的task数量(还是线程)
sparkConf.set("spark.default.parallelism", "4")
val sc = new SparkContext(sparkConf)
val rdd = sc.parallelize(List("coffee panda","happy panda","happiest panda party"),20)
val temp2 = rdd.flatMap(x=>x.split("\\s+")).distinct().collect()
// 输出对象个数,对象数据类型
println(temp2.size,temp2.getClass.getSimpleName())
temp2.foreach(println(_))
sc.stop()
}
}
(5,String[])
coffee
panda
happiest
party
happy
Spark源码解析排序算子sortBy和sortByKey存在未排序的情况
package com.jieky.studySpark
import org.apache.spark.{SparkConf, SparkContext}
object App {
def main(args: Array[String]): Unit = {
// 设置并行度,local[*]表示并行度为本地机器的最大虚拟核数(线程)
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
// 手动设置并行度,即能够同时运行的task数量(还是线程)
sparkConf.set("spark.default.parallelism", "4")
val sc = new SparkContext(sparkConf)
val array_left = 1 until 4 //生成1到count的数组
val array_right = Array("工单", "电力", "展示")
val result = array_left.zip(array_right)
// 调用foreach行动算子,分区内有序,分区间无序
println(sc.parallelize(result,2).sortBy(_._1,true).foreach(println(_)))
println(sc.parallelize(result,2).sortByKey(true).foreach(println(_)))
println("*"*20)
// 调用foreach行动算子,分区内有序,分区间无序
println(sc.parallelize(result,1).sortBy(_._1,true).foreach(println(_)))
println(sc.parallelize(result,1).sortByKey(true).foreach(println(_)))
println("*"*20)
// 调用collect行动算子,整体有序;这里的foreach是scala中的算子,不是spark中算子
println(sc.parallelize(result,2).sortBy(_._1,true).collect().foreach(println(_)))
println(sc.parallelize(result,2).sortByKey(true).collect().foreach(println(_)))
sc.stop()
}
}
(1,工单)
(3,展示)
(2,电力)
()
(3,展示)
(1,工单)
(2,电力)
()
********************
(1,工单)
(2,电力)
(3,展示)
()
(1,工单)
(2,电力)
(3,展示)
()
********************
(1,工单)
(2,电力)
(3,展示)
()
(1,工单)
(2,电力)
(3,展示)
()
def makeIncreaser(more:Int) = (x:Int) => x + more
// inc1、inc9999为闭包,可以取more的当前的值,也可以在函数内修改more的值
val inc1=makeIncreaser(1)
val inc9999=makeIncreaser(9999)
println(inc1(10))
println(inc9999(10))
11
10009
package com.jieky.studySpark
import org.apache.spark.{SparkConf, SparkContext}
object App {
def main(args: Array[String]): Unit = {
// 设置并行度,local[*]表示并行度为本地机器的最大虚拟核数(线程)
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
// 手动设置并行度,即能够同时运行的task数量(还是线程)
sparkConf.set("spark.default.parallelism", "4")
val sc = new SparkContext(sparkConf)
val array_left = 1 until 4 //生成1到count的数组
val array_right = Array("工单", "电力", "展示")
val result = array_left.zip(array_right)
val temp = sc.parallelize(result)
.sortBy(_._1)
.map(_._1)
.filter(_%2==0)
.map(_*2)
// RDD 血缘关系
println(temp.toDebugString)
/*
宽依赖:ShuffleDependency
窄依赖:OneToOneDependency、RangeDependency、NarrowDependency
PS:OneToOneDependency、RangeDependency为NarrowDependency的子类
*/
// RDD 依赖关系
println(temp.dependencies)
sc.stop()
}
}
(3) MapPartitionsRDD[8] at map at App.scala:20 []
| MapPartitionsRDD[7] at filter at App.scala:19 []
| MapPartitionsRDD[6] at map at App.scala:18 []
| MapPartitionsRDD[5] at sortBy at App.scala:17 []
| ShuffledRDD[4] at sortBy at App.scala:17 []
+-(4) MapPartitionsRDD[1] at sortBy at App.scala:17 []
| ParallelCollectionRDD[0] at parallelize at App.scala:16 []
List(org.apache.spark.OneToOneDependency@e5cbff2)
RDD 任务切分中间分为:Application、Job、Stage 和 Task
① Application(应用程序):初始化一个 SparkContext 即生成一个Application;整个程序即为一个Application,代码中setAppName是为主程序起名字
② Job(作业):一个Action(行动算子) 算子就会生成一个Job;
③ Stage:Stage 等于宽依赖(ShuffleDependency)的个数加 1(+1为ResultStage,ResultStage是整个流程的最后一个阶段);
④ Task:一个 Stage 阶段中,最后一个RDD 的分区个数就是Task 的个数。
PS:Application->Job->Stage->Task 每一层都是 1 对 n 的关系。
Spark目前支持Hash分区和Range分区,用户也可以自定义分区,Hash分区为当前的默认分区,Spark中分区器直接决定了RDD中分区的个数、RDD中每条数据经过Shuffle过程属于哪个分区和Reduce的个数。
注意:
(1) 只有Key-Value类型的RDD才有分区器的,非Key-Value类型的RDD分区器的值是None
(2) 每个RDD的分区ID范围:0~numPartitions-1,决定这个值是属于那个分区的。
hash分区器:快,可能数据倾斜
range分区器:慢,一定程度避免数据倾斜
package com.jieky.studySpark
import org.apache.spark.{HashPartitioner, Partitioner, RangePartitioner, SparkConf, SparkContext}
object App {
def main(args: Array[String]): Unit = {
// 设置并行度,local[*]表示并行度为本地机器的最大虚拟核数(线程)
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
// 手动设置并行度,即能够同时运行的task数量(还是线程)
sparkConf.set("spark.default.parallelism", "4")
val sc = new SparkContext(sparkConf)
val array_left = 1 until 4 //生成1到count的数组
val array_right = Array("工单", "电力", "展示")
val result = array_left.zip(array_right)
val temp = sc.parallelize(result,5)
println("分区器:",temp.partitioner)
println("分区:")
temp.partitions.foreach(println(_))
println("-"*20)
// HashPartitioner构造参数3就是分区数量,也是启动的reduce task数量,
// 也是reduceByKey结果返回的子RDD的partitions方法返回的数组的长度。
val temp1 = sc.parallelize(result,2).partitionBy(new HashPartitioner(3))
println("分区器:",temp1.partitioner)
println("分区:")
temp1.partitions.foreach(println(_))
println("-"*20)
val nopar = sc.parallelize(List((1,3),(1,2),(2,4),(2,3),(3,6),(3,8)),8)
//val temp2 = nopar.mapPartitionsWithIndex((index,iter)=>{ Iterator(index.toString+" : "+iter.mkString("|")) }).collect()
//temp2.foreach(println(_))
/*
如果没有显式指定分区器,按如下规则调用分区器:
1、查看父RDD有无partitioner,若有则使用父partitioner
2、查看sparkConf是否定义spark.default.parallelism,若有则返回new HashPartitioner(sc.defaultParallelism)
3、以上都没有,则返回new HashPartitioner(rdd_parent.partitions.length)作为默认分区器
* */
val hashpar = nopar.partitionBy(new org.apache.spark.HashPartitioner(7))
println(hashpar.count)
println(hashpar.partitioner)
println("-"*20)
val pairs = sc.parallelize(List((1,1),(2,2),(3,3)))
val Hashpartiton = pairs.partitionBy(new RangePartitioner(2,pairs))
println("分区器:"+Hashpartiton.partitioner)
println("分区:")
Hashpartiton.partitions.foreach(println)
println("-"*20)
// 自定义分区器,需重写函数:numPartitions、getPartition
val listRDD = sc.makeRDD(List(("a",1),("b",2),("c",3))).partitionBy(new Partitioner{
override def numPartitions: Int = {
3
}
override def getPartition(key: Any): Int = {
1
}
})
println("分区器:"+listRDD.partitioner)
println("分区:")
listRDD.partitions.foreach(println)
}
}
Spark共享变量—累加器(及transformation和action回顾)Spark 持久化(cache和persist的区别)
在Spark中如果想在Task计算的时候统计某些事件的数量,使用filter/reduce也可以,但是使用累加器是一种更方便的方式,累加器一个比较经典的应用场景是用来在Spark Streaming应用中记录某些事件的数量。使用累加器时需要注意只有Driver能够取到累加器的值,Task端进行的是累加操作。(可以认为在task端使用写锁,一次只能一个task写入,不会出现竞争导致数据出错)
Spark提供的Accumulator,主要用于多个节点对一个变量进行共享性的操作。Accumulator只提供了累加的功能,只能累加,不能减少。累加器只能在Driver端构建,并只能从Driver端读取结果,在Task端只能进行累加。执行算子被调用时,累加器变量才会被更新。
注意:在每个执行器上更新累加器,都会将累加数据转发回Driver驱动程序。(所以为了避免网络传输次数过大,可以将多次更新的值放入本地变量,到达指定数值后,更新给累加器,减少网络传输次数)
cache()和persist()的使用是有规则的:必须在transformation或者textfile等创建一个rdd之后,直接连续调用cache()或者persist()才可以,如果先创建一个rdd,再单独另起一行执行cache()或者persist(),是没有用的,而且会报错,大量的文件会丢失。通过源码可以看出cache()是persist()的简化方式,调用persist的无参版本。
package com.jieky.studySpark
import org.apache.spark.util.AccumulatorV2
import org.apache.spark.{HashPartitioner, Partitioner, RangePartitioner, SparkConf, SparkContext}
object App {
def main(args: Array[String]): Unit = {
// 设置并行度,local[*]表示并行度为本地机器的最大虚拟核数(线程)
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("spark")
// 手动设置并行度,即能够同时运行的task数量(还是线程)
sparkConf.set("spark.default.parallelism", "4")
val sc = new SparkContext(sparkConf)
// sc.collectionAccumulator[String]("")
// sc.doubleAccumulator("")
val accum = sc.longAccumulator("Error2 Accumulator")
val numberRDD = sc.parallelize(1 to 10).map(n => {
accum.add(1)
n + 1
})
// 使用cache方法(或persist),否则每执行一次执行算子都从头开始计算RDD,从而导致累加器被重复执行
numberRDD.cache().count()
println("accum1: " + accum.value)
numberRDD.reduce(_+_)
println("accum2: " + accum.value)
println("-"*20)
//自定义累加器
val myAccum = new MyAccumulatorV2
sc.register(myAccum,"DIY累加器")
val sum: Int = sc.parallelize(
Array("1", "2a", "3", "4f", "a5", "6", "2a"), 2)
.filter(line => {
val pattern = """^-?(\d+)"""
val flag = line.matches(pattern)
if (flag) {
myAccum.add(line)
}
flag
}
).map(_.toInt).reduce(_+_)
println("计算:"+sum+" = "+ myAccum.value.toArray().mkString("+"))
sc.stop()
}
}
class MyAccumulatorV2 extends AccumulatorV2[String, java.util.Set[String]]{
private val set:java.util.Set[String] = new java.util.HashSet[String]()
// 返回该累加器是否为零值
override def isZero: Boolean = {
set.isEmpty
}
// 用于重置累加器为初始状态
override def reset(): Unit = {
set.clear()
}
// 用于向累加器加一个值
override def add(v: String): Unit = {
set.add(v)
}
// 用于合并另一个同类型的累加器到当前累加器
override def merge(other: AccumulatorV2[String, java.util.Set[String]]): Unit = {
other match {
case o:MyAccumulatorV2 => set.addAll(o.value)
}
}
// 获取此累加器的当前值
override def value: java.util.Set[String] = {
// Returns an unmodifiable view of the specified set
java.util.Collections.unmodifiableSet(set)
}
// 创建此累加器的新副本
override def copy(): MyAccumulatorV2 = {
val newAcc = new MyAccumulatorV2()
// 对应set对象加锁
set.synchronized{
newAcc.set.addAll(set)
}
newAcc
}
}