一、集群搭建

基于Hadoop 3.3.0 安装部署 Spark 3.0.0 分布式集群
 
官网教程
 

二、spark-shell 实践

1、在及 master 节点上创建文件 (Spark 集群中任一节点即可)

[root@master ~]# touch /tmp/spark-test.txt

2、在文件中写入内容

[root@master ~]# echo "spark hive hadoop hdfs mr pig flume hbase flink kudu zookeeper yarn java mysql file desk home dev root test phone map list for if else try hdfs mr pig flume hbase flink kudu zookeeper hive hadoop hdfs mr pig flume hbase flink kudu zookeeper yarn java mysql file desk home dev root test phone map list for if zookeeper yarn java mysql file desk mr pig flume hbase flink" >> /tmp/spark-test.txt

3、启动 HDFS (过程:略)
4、启动 Zookeeper (过程:略)
5、启动 YARN 及相关服务(过程:略)
6、在 HDFS 中创建 Spark 的输入目录(即源文件存放目录)

[root@master ~]# hdfs dfs -mkdir -p /user/spark/in

7、把创建、编辑的文件 spark-test.txt 上传至 HDFS

[root@master ~]# hdfs dfs -put /tmp/HelloSpark.txt /user/spark/in

8、控制台运行 spark-shell

[root@master ~]# /usr/bigdata/spark-3.0.0-bin-hadoop3.2/bin/spark-shell

记一次  基于Hadoop 3.3.0 安装部署 Spark 3.0.0 分布式集群 spark-shell 运行 WordCount_java
9、Spark 创建 sc 后,可以加载本地文件创建 RDD ,这里测试是加载 spark自带的本地文件 spark-test.txt ,返回一个 MapPartitionsRDD 文件。

# 本地
scala> val textFile = sc.textFile("file:///tmp/spark-test.txt");
# HDFS
scala> val textFile = sc.textFile("hdfs:///user/spark/in/spark-test.txt");

10、textFile: org.apache.spark.rdd.RDD[String] = file:///tmp/spark-test.txt MapPartitionsRDD[1] at textFile at :24
加载HDFS文件和本地文件都是使用 textFile,区别是添加前缀(hdfs://和file://)进行标识,从本地读取文件直接返回MapPartitionsRDD ,而从 HDFS 读取的文件是先转成 HadoopRDD,然后隐试转换成 MapPartitionsRDD。
对于RDD可以执行Transformation返回新的RDD,也可以执行Action得到返回结果。first 命令返回文件第一行,count 命令返回文件所有行数。

scala> textFile.count();
res12: Long = 1

11、查看第一行的内容

scala> textFile.first();
res3: String = spark hive hadoop hdfs mr pig flume hbase flink kudu zookeeper yarn java mysql file desk home dev root test phone map list for if else try hdfs mr pig flume hbase flink kudu zookeeper hive hadoop hdfs mr pig flume hbase flink kudu zookeeper yarn java mysql file desk home dev root test phone map list for if zookeeper yarn java mysql file desk mr pig flume hbase flink

12、计算文本中单词最多的一行的单词数

scala> textFile.map(line =>line.split(" ").size).reduce((a,b) => if (a > b) a else b);
res8: Int = 70

13、这个过程返回的是一个(string,int)类型的键值对ShuffledRDD(y执行reduceByKey的时候需要进行Shuffle操作,返回的是一个Shuffle形式的RDD),最后用Collect聚合统计结果

scala> val wordCount = textFile.flatMap(line =>line.split(" ")).map(x => (x,1)).reduceByKey((a,b) => a+b);
wordCount: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[9] at reduceByKey at <console>:25

14、计算每个单词的数量

scala> wordCount.collect
res9: Array[(String, Int)] = Array((hive,2), (phone,2), (flink,4), (zookeeper,4), (map,2), (mysql,3), (list,2), (pig,4), (java,3), (root,2), (yarn,3), (file,3), (test,2), (desk,3), (spark,1), (hadoop,2), (if,2), (home,2), (mr,4), (flume,4), (else,1), (kudu,3), (try,1), (for,2), (dev,2), (hdfs,3), (hbase,4))

15、使用了占位符_,使表达式更为简洁,是Scala语音的特色,每个_代表一个参数。

scala> val wordCount2 = textFile.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_);
wordCount2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[12] at reduceByKey at <console>:25
scala> wordCount2.collect
res10: Array[(String, Int)] = Array((hive,2), (phone,2), (flink,4), (zookeeper,4), (map,2), (mysql,3), (list,2), (pig,4), (java,3), (root,2), (yarn,3), (file,3), (test,2), (desk,3), (spark,1), (hadoop,2), (if,2), (home,2), (mr,4), (flume,4), (else,1), (kudu,3), (try,1), (for,2), (dev,2), (hdfs,3), (hbase,4))

16、Spark默认不进行排序,如有需要排序输出,排序的时候将key和value互换,使用sortByKey方法指定升序(true)和降序(false)

scala> val wordCount3 = textFile.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).map(x=>(x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1));
wordCount3: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[20] at map at <console>:25
scala> wordCount3.collect
res11: Array[(String, Int)] = Array((flink,4), (zookeeper,4), (pig,4), (mr,4), (flume,4), (hbase,4), (mysql,3), (java,3), (yarn,3), (file,3), (desk,3), (kudu,3), (hdfs,3), (hive,2), (phone,2), (map,2), (list,2), (root,2), (test,2), (hadoop,2), (if,2), (home,2), (for,2), (dev,2), (spark,1), (else,1), (try,1))

至此,基于Hadoop 3.3.0 安装部署 Spark 3.0.0 分布式集群 spark-shell 运行 WordCount 实践完毕,希望能够对您有所帮助!

其他参考