大数据系列之运维(自主搭建的大数据平台)
(9)Spark运维
- 打开 Linux Shell 启动 spark-shell终端,将启动的程序进程信息以文本形式提交到答题框中。
[root@master ~]# spark-shell
20/03/31 21:31:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://master:4040
Spark context available as 'sc' (master = local[*], app id = local-1585661525987).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.5
/_/
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
- 启动 spark-shell 后,在 scala 中加载数据“1,2,3,4,5,6,7,8,9,10”,求这些数据的 2 倍乘积能够被 3 整除的数字,并通过 toDebugString 方法来查看 RDD 的谱系。
scala> val num = sc.parallelize(1 to 10)
num: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> val doublenum = num.map(_*2)
doublenum: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:25
scala> val threenum = doublenum.filter(_%3==0)
threenum: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[2] at filter at <console>:25
scala> threenum.collect
res0: Array[Int] = Array(6, 12, 18)
scala> threenum.toDebugString
res1: String = (2) MapPartitionsRDD[2] at filter at <console>:25 []
| MapPartitionsRDD[1] at map at <console>:25 []
| ParallelCollectionRDD[0] at parallelize at <console>:24 []
- 启动 spark-shell 后,在scala 中加载 Key-Value 数据(“A”,1),(“B”,2),(“C”,3),(“A”,4),(“B”,5),(“C”,4),(“A”,3),(“A”,9),(“B”,4),(“D”,5)将这些数据以 Key 为基准进行升序排序,并以 Key 为基准进行分组。
scala> val kv = sc.parallelize(List(("A",1),("B",2),("C",3),("A",4),("B",5),("C",4),("A",3),("A",9),("B",4),("D",5)))
kv: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[3] at parallelize at <console>:24
scala> val kv1 = kv.sortByKey().collect
kv1: Array[(String, Int)] = Array((A,1), (A,4), (A,3), (A,9), (B,2), (B,5), (B,4), (C,3), (C,4), (D,5))
scala> kv.groupByKey().collect
res5: Array[(String, Iterable[Int])] = Array((B,CompactBuffer(2, 5, 4)), (D,CompactBuffer(5)), (A,CompactBuffer(1, 4, 3, 9)), (C,CompactBuffer(3, 4)))
- 启动 spark-shell 后,在 scala 中加载 Key-Value 数据(“A”,1),(“B”,3),(“C”,5),(“D”,4),(“B”,7),(“C”,4),(“E”,5),(“A”,8),(“B”,4),(“D”,5)将这些数据以 Key 为基准进行升序排序,并对相同的 Key进行 Value 求和计算。
scala> val kv2 = sc.parallelize(List(("A",1),("B",3),("C",5),("D",4),("B",7),("C",4),("E",5),("A",8),("B",4),("D",5)))
kv2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[9] at parallelize at <console>:24
scala> val kv3 = kv2.sortByKey().collect
kv3: Array[(String, Int)] = Array((A,1), (A,8), (B,3), (B,7), (B,4), (C,5), (C,4), (D,4), (D,5), (E,5))
scala> kv2.reduceByKey(_+_).collect
res8: Array[(String, Int)] = Array((B,14), (D,9), (A,9), (C,9), (E,5))
- 启动 spark-shell 后,在 scala 中加载 Key-Value 数据(“A”,4),(“A”,2),(“C”,3),(“A”,4),(“B”,5),(“C”,3),(“A”,4)以 Key 为基准进行去重操作,并通过 toDebugString 方法来查看 RDD 的谱系。
scala> val kv4 = sc.parallelize(List(("A",4),("A",2),("C",3),("A",4),("B",5),("C",3),("A",4)))
kv4: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[15] at parallelize at <console>:24
scala> kv4.distinct.collect
res14: Array[(String, Int)] = Array((A,4), (B,5), (C,3), (A,2))
scala> kv4.toDebugString
res15: String = (2) ParallelCollectionRDD[15] at parallelize at <console>:24 []
- 启动 spark-shell 后,在 scala 中加载两组 Key-Value 数据(“A”,1),(“B”,2),(“C”,3),(“A”,4),(“B”,5)、(“A”,1),(“B”,2),(“C”,3),(“A”,4),(“B”,5),将两组数据以 Key 为基准进行 JOIN 操作。
scala> val kv5 = sc.parallelize(List(("A",1),("B",2),("C",3),("A",4),("B",5)))
kv5: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[19] at parallelize at <console>:24
scala> val kv6 = sc.parallelize(List(("A",1),("B",2),("C",3),("A",4),("B",5)))
kv6: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[20] at parallelize at <console>:24
scala> kv5.join(kv6).collect
res16: Array[(String, (Int, Int))] = Array((B,(2,2)), (B,(2,5)), (B,(5,2)), (B,(5,5)), (A,(1,1)), (A,(1,4)), (A,(4,1)), (A,(4,4)), (C,(3,3)))
- 登录 spark-shell,定义 i 值为 1,sum 值为 0,使用 while 循环,求从 1 加到 100 的值,最后使用 scala 的标准输出函数输出 sum 值。
scala> var i = 1
i: Int = 1
scala> var sum = 0
sum: Int = 0
scala> while(i<=100){
| sum += i
| i=i+1
| }
scala> println(sum)
5050
- 登录 spark-shell,定义 i 值为 1,sum 值为 0,使用 for 循环,求从 1 加到100 的值,最后使用 scala 的标准输出函数输出 sum 值。
scala> var i = 1
i: Int = 1
scala> var sum = 0
sum: Int = 0
scala> for(i<-1 to 100){
| sum += i
| }
scala> println(sum)
5050
- 登录 spark-shell,定义变量 i、sum,并赋 i 初值为 1、sum 初值为 0、步长为 3,使用 while 循环,求从 1 加到 2018 的值,最后使用 scala 的标准输出函数输出 sum 值。
scala> var i = 1
i: Int = 1
scala> var sum = 0
sum: Int = 0
scala> while(i<=2018){
| sum+=i
| i=i+3
| }
scala> println(sum)
679057
- 任何一种函数式语言中,都有 map 函数与 faltMap 这两个函数:map 函数的用法,顾名思义,将一个函数传入 map 中,然后利用传入的这个函数,将集合中的每个元素处理,并将处理后的结果返回。而flatMap与map唯一不一样的地方就是传入的函数在处理完后返回值必须是 List,所以需要返回值是 List 才能执行 flat 这一步。
(1)登录 spark-shell,自定义一个 list,然后利用 map 函数,对这个 list 进
行元素乘 2 的操作。定义值为(1,2,3,4,5,6,7,8,9)
scala> import scala.math._
import scala.math._
scala> val nums=List(1,2,3,4,5,6,7,8,9)
nums: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9)
scala> nums.map(x=>x*2)
res23: List[Int] = List(2, 4, 6, 8, 10, 12, 14, 16, 18)
(2)登录 spark-shell,自定义一个 list,然后利用 flatMap 函数将 list转换成Char类型List变量,并转换为大写。定义值为(“Hadoop”,“Spark”,“Java”)
scala> import scala.math._
import scala.math._
scala> val data = List("Hadoop","Spark","Java")
data: List[String] = List(Hadoop, Spark, Java)
scala> data.flatMap(_.toUpperCase)
res24: List[Char] = List(H, A, D, O, O, P, S, P, A, R, K, J, A, V, A)
- 登录大数据云主机 master 节点,在 root 目录下新建一个 abc.txt,里面的内容为:
hadoop hive
solr redis
kafka hadoop
storm flume
sqoop docker
spark spark
hadoop spark
elasticsearch hbase
hadoop hive
spark hive
hadoop spark
然后登录 spark-shell,首先使用命令统计 abc.txt 的行数,接着对 abc.txt 文档中的单词进行计数,并按照单词首字母的升序进行排序,最后统计结果行数。
scala> val words = sc.textFile("file:///root/abc.txt").count
words: Long = 11
scala> val words = sc.textFile("file:///root/abc.txt").flatMap(_.split("\\W+")).map(x=>(x,1)).reduceByKey(_+_).sortByKey().collect
words: Array[(String, Int)] = Array((docker,1), (elasticsearch,1), (flume,1), (hadoop,5), (hbase,1), (hive,3), (kafka,1), (redis,1), (solr,1), (spark,5), (sqoop,1), (storm,1))
scala> val words = sc.textFile("file:///root/abc.txt").flatMap(_.split("\\W+")).map(x=>(x,1)).reduceByKey(_+_).sortByKey().count
words: Long = 12
注意:在abc.txt文件里,不能有空行,空行也会计算进去。
- 登录 spark-shell,自定义一个 List,使用 spark 自带函数对这个 List 进行去重操作。
scala> val data = sc.parallelize(List(1,2,3,4,5,1,2,3,4,5,3,2))
data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[68] at parallelize at <console>:27
scala> data.distinct.collect
res25: Array[Int] = Array(4, 2, 1, 3, 5)
- 登录“spark-shell”交互界面。导入如下数据:
2017-01-01 a
2017-01-01 b
2017-01-01 c
2017-01-02 a
2017-01-02 b
2017-01-02 d
2017-01-03 b
2017-01-03 e
2017-01-03 f
由导入数据可见,2017-01-01起新增三个用户(a、b、c),2017-01-02新增一个用户(d),2017-01-03新增两个用户(e、f);使用 spark 工具,统计每个日期
新增加的用户数,最后显示统计结果。
scala> val rdd1 = spark.sparkContext.parallelize(Array(("2017-01-01","a"),("2017-01-01","b"),("2017-01-01","c"),("2017-01-02","a"),("2017-01-02","b"),("2017-01-02","d"),("2017-01-03","b"),("2017-01-03","e"),("2017-01-03","f")))
rdd1: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[72] at parallelize at <console>:26
scala> val rdd2 = rdd1.map(kv=>(kv._2,kv._1))
rdd2: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[73] at map at <console>:30
scala> val rdd3 = rdd2.groupByKey()
rdd3: org.apache.spark.rdd.RDD[(String, Iterable[String])] = ShuffledRDD[74] at groupByKey at <console>:28
scala> val rdd4 = rdd3.map(kv=>(kv._2.min,1))
rdd4: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[75] at map at <console>:30
scala> rdd4.countByKey().foreach(println)
(2017-01-03,2)
(2017-01-01,3)
(2017-01-02,1)
- 登录“spark-shell”交互界面。定义一个函数,函数的作用是比较传入两个变量,返回最大的那个。
定义一个名字为MaxNum带参数也带返回值的函数,函数的作用是比较传入的两个int类型的变量,返回大的那个。定义完成后,使用23876567和23786576验证函数。
scala> def MaxNum(x:Int,y:Int):Int = if(x>y) x else y
MaxNum: (x: Int, y: Int)Int
scala> MaxNum(23876567,23786576)
res27: Int = 23876567
- 登录“spark-shell”交互界面。定义一维整型数组AA=[3,4,3,2,44,3,22,231,4,5,2,345,2,2,11,124,35,349,34],自定义函数sum(),根据传入的整型数组名和数组的长度,求为i为奇数的各AA[i]之和。
scala> def sum(aa:Array[Int],n:Int):Int={
| var he=0
| var i=0
| for(i<-0 until(n,2))
| he=he+aa(i)
| he
| }
sum: (aa: Array[Int], n: Int)Int
scala> var aa = Array(3,4,3,2,44,3,22,231,4,5,2,345,2,2,11,124,35,349,34)
aa: Array[Int] = Array(3, 4, 3, 2, 44, 3, 22, 231, 4, 5, 2, 345, 2, 2, 11, 124, 35, 349, 34)
scala> aa.length
res37: Int = 19
scala> sum(aa,19)
res28: Int = 160