Spark SQL 编程指南

Spark SQL是用于结构化数据处理的一个模块。同Spark RDD 不同地方在于Spark SQL的API可以给Spark计算引擎提供更多地 信息,例如:数据结构、计算算子等。在内部Spark可以通过这些信息有针对对任务做优化和调整。这里有几种方式和Spark SQL进行交互,例如Dataset API和SQL等,这两种API可以混合使用。Spark SQL的一个用途是执行SQL查询。 Spark SQL还可用于从现有Hive安装中读取数据。从其他编程语言中运行SQL时,结果将作为Dataset/DataFrame返回,使用命令 行或JDBC / ODBC与SQL接口进行交互。

Dataset是一个分布式数据集合在Spark 1.6提供一个新的接口,Dataset提供RDD的优势(强类型,使用强大的lambda函 数)以及具备了Spark SQL执行引擎的优点。Dataset可以通过JVM对象构建,然后可以使用转换函数等(例如:map、flatMap、filter等),目前Dataset API支持Scala和Java 目前Python对Dataset支持还不算完备。

DataFrame是命名列的数据集,他在概念是等价于关系型数据库。DataFrames可以从很多地方构建,比如说结构化数据文 件、hive中的表或者外部数据库,使用Dataset[row]的数据集,可以理解DataFrame就是一个Dataset[Row].

SparkSession

Spark中所有功能的入口点是SparkSession类。要创建基本的SparkSession,只需使用SparkSession.buildeåçr():

  • 依赖
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.11</artifactId>
    <version>2.4.3</version>
</dependency>
  • Drvier程序
//1.创建SparkSession
val spark = SparkSession.builder()
          .appName("hellosql")
          .master("local[10]")
.getOrCreate()

//2.引入改隐试转换 主要是 将 RDD 转换为 DataFrame/Dataset 
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL") 

//关闭spark
spark.stop()

Dataset

Dataset与RDD类似,但是它们不使用Java序列化或Kryo,而是使用专用的Encoder来序列化对象以便通过网络进行处理或传输。虽然Encoder和标准序列化都负责将对象转换为字节,但Encoder是动态生成的代码,并使用一种格式,允许Spark执行许多操作,如过滤,排序和散列,而无需将字节反序列化为对象。

  • case-class
case class Person(id:Int,name:String,age:Int,sex:Boolean)

val dataset: Dataset[Person] =List(Person(1,"zhangsan",18,true),Person(2,"wangwu",28,true)).toDS() 
dataset.select($"id",$"name").show()
  • Tuple元组
val dataset: Dataset[(Int,String,Int,Boolean)] = List((1,"zhangsan",18,true),(2,"wangwu",28,true)).toDS()
dataset.select($"_1",$"_2").show()

//或者
dataset.selectExpr("_1 as id","_2 as name","(_3 * 10) as age").show()
  • json数据
{"name":"张三","age":18} 
{"name":"lisi","age":28} 
{"name":"wangwu","age":38}
case class Person(id:Int,name:String,age:Int,sex:Boolean)
val dataset = spark.read.json("D:///Persion.json").as[Person]
dataset.show()
  • rdd

元组

val userRDD = spark.sparkContext.makeRDD(List((1,"张三",true,18,15000.0)))
userRDD.toDS().show()
+---+----+----+---+-------+
| _1|  _2|  _3| _4|     _5|
+---+----+----+---+-------+
|  1|张三|true| 18|15000.0|
+---+----+----+---+-------+

case-class

val userRDD = spark.sparkContext.makeRDD(List(User(1,"张三",true,18,15000.0)))
userRDD.toDS().show()
+---+----+----+---+-------+
| id|name| sex|age| salary|
+---+----+----+---+-------+
|  1|张三|true| 18|15000.0|
+---+----+----+---+-------+

DataFrame

Data Frame是命名列的数据集,他在概念是等价于关系型数据库。DataFrames可以从很多地方构建,比如说结构化数据文 件、hive中的表或者外部数据库,使用Dataset[row]的数据集,可以理解DataFrame就是一个Dataset[Row].

  • json文件
val frame = spark.read.json("file:///f:/person.json")
frame.show()
  • case-class
List(Person("zhangsan",18),Person("王五",20)).toDF("uname","uage").show()
  • Tuple元组
List(("zhangsan",18),("王五",20)).toDF("name","age").show()
  • RDD转换

Row

val userRDD = spark.sparkContext.makeRDD(List((1,"张三",true,18,15000.0)))
                                    .map(t=>Row(t._1,t._2,t._3,t._4,t._5))
var schema=new StructType()
            .add("id","int")
            .add("name","string")
            .add("sex","boolean")
            .add("age","int")
            .add("salary","double")

spark.createDataFrame(userRDD,schema).show()
+---+----+----+---+-------+
| id|name| sex|age| salary|
+---+----+----+---+-------+
|  1|张三|true| 18|15000.0|
+---+----+----+---+-------+

Javabean

val userRDD = spark.sparkContext.makeRDD(List(new User(1,"张三",true,18,15000.0)))

spark.createDataFrame(userRDD,classOf[User]).show()

提示:这里的User必须是JavaBean对象。

+---+----+----+---+-------+
| id|name| sex|age| salary|
+---+----+----+---+-------+
|  1|张三|true| 18|15000.0|
+---+----+----+---+-------+

case-class

val userRDD = spark.sparkContext.makeRDD(List(User(1,"张三",true,18,15000.0)))

spark.createDataFrame(userRDD).show()
+---+----+----+---+-------+
| id|name| sex|age| salary|
+---+----+----+---+-------+
|  1|张三|true| 18|15000.0|
+---+----+----+---+-------+

tuple元组

val userRDD = spark.sparkContext.makeRDD(List((1,"张三",true,18,15000.0)))

spark.createDataFrame(userRDD).show()
+---+----+----+---+-------+
| _1|  _2|  _3| _4|     _5|
+---+----+----+---+-------+
|  1|张三|true| 18|15000.0|
+---+----+----+---+-------+

Dataset/DataFrame API

有如下测试数据:

val users=List("1,Michael,false,29,2000",
      "2,Andy,true,30,5000",
      "3,Justin,true,19,1000",
      "4,Kaine,false,20,5000",
      "5,Lisa,false,19,1000")

val userRDD = spark.sparkContext.parallelize(users,3)
.map(line=>{
  val tokens = line.split(",")
  User(tokens(0).toInt,tokens(1),tokens(2).toBoolean,tokens(3).toInt,tokens(4).toDouble)
})

select

userRDD.toDF().select($"id",$"name",$"sex",$"salary"*12 as "年薪").show()
+---+-------+-----+-------+
| id|   name|  sex|   年薪|
+---+-------+-----+-------+
|  1|Michael|false|24000.0|
|  2|   Andy| true|60000.0|
|  3| Justin| true|12000.0|
|  4|  Kaine|false|60000.0|
|  5|   Lisa|false|12000.0|
+---+-------+-----+-------+

filter

userRDD.toDF()
  .select($"id",$"name",$"sex",$"salary"*12 as "年薪")
  .filter($"年薪">50000)
  .show()
+---+-----+-----+-------+
| id| name|  sex|   年薪|
+---+-----+-----+-------+
|  2| Andy| true|60000.0|
|  4|Kaine|false|60000.0|
+---+-----+-----+-------+

where

userRDD.toDF()
.select($"id",$"name",$"sex",$"salary"*12 as "annual_salary")
.where($"name" like "%i%" and $"年薪" > 12000) //等价 "name like '%i%' and annual_salary > 12000.0D"
.show()
+---+-------+-----+-------------+
| id|   name|  sex|annual_salary|
+---+-------+-----+-------------+
|  1|Michael|false|      24000.0|
|  4|  Kaine|false|      60000.0|
+---+-------+-----+-------------+

特别注意,目前Spark对别名中含有中文支持不是多么友好,在使用stringexpress的时候存在bug.

withColumn

userRDD.toDF()
  .select($"id",$"name",$"sex",$"salary"*12 as "annual_salary")
  .where("name like '%i%' and annual_salary > 12000.0D")
  .withColumn("annual_reward",$"annual_salary" * 0.8)
  .show()
+---+-------+-----+-------------+-------------+
| id|   name|  sex|annual_salary|annual_reward|
+---+-------+-----+-------------+-------------+
|  1|Michael|false|      24000.0|      19200.0|
|  4|  Kaine|false|      60000.0|      48000.0|
+---+-------+-----+-------------+-------------+

withColumnRenamed

var userDF=spark.sparkContext.parallelize(List(
 User(1,"张晓三",true,18,15000),
 User(2,"李晓四",true,18,18000),
 User(3,"王晓五",false,18,10000)
)).toDF().as("t_user" )

userDF.withColumnRenamed("name","username").show()
+---+--------+-----+---+-------+
| id|username|  sex|age| salary|
+---+--------+-----+---+-------+
|  1|  张晓三| true| 18|15000.0|
|  2|  李晓四| true| 18|18000.0|
|  3|  王晓五|false| 18|10000.0|
+---+--------+-----+---+-------+

groupBy

userRDD.toDF()
      .select($"id",$"name",$"sex",$"salary"*12 as "annual_salary")
      .groupBy($"sex")
      .avg("annual_salary")//等价 mean
      .show()
+-----+------------------+
|  sex|avg(annual_salary)|
+-----+------------------+
| true|           36000.0|
|false|           32000.0|
+-----+------------------+

聚合算子:sum、max、min、mean、avg

agg

userRDD.toDF()
      .select($"id",$"name",$"sex",$"salary")
      .groupBy($"sex")
      .agg(Map("salary"->"max","salary"->"mean"))
      .show()
+-----+------------------+
|  sex|       avg(salary)|
+-----+------------------+
| true|            3000.0|
|false|2666.6666666666665|
+-----+------------------+

等价用法

import org.apache.spark.sql.functions._
userRDD.toDF()
  .select($"id",$"name",$"sex",$"salary")
  .groupBy($"sex")
  .agg(avg($"salary") as "avg_salary",sum($"salary") as "total_salary" )
  .show()
+-----+------------------+------------+
|  sex|        avg_salary|total_salary|
+-----+------------------+------------+
| true|            3000.0|      6000.0|
|false|2666.6666666666665|      8000.0|
+-----+------------------+------------+

开窗函数

import org.apache.spark.sql.functions._

var w=Window.partitionBy("sex")
  .orderBy("salary")
  .rangeBetween(Window.unboundedPreceding,Window.unboundedFollowing)

userRDD.toDS().select("id","sex","name","salary")
  .withColumn("avg_salary",avg($"salary").over(w))
  .show()
+---+-----+-------+------+------------------+
| id|  sex|   name|salary|        avg_salary|
+---+-----+-------+------+------------------+
|  3| true| Justin|1000.0|            3000.0|
|  2| true|   Andy|5000.0|            3000.0|
|  5|false|   Lisa|1000.0|2666.6666666666665|
|  1|false|Michael|2000.0|2666.6666666666665|
|  4|false|  Kaine|5000.0|2666.6666666666665|
+---+-----+-------+------+------------------+

pivot

var studentRDD=spark.sparkContext.parallelize(List(
  UserCost(1,"电子类",100),
  UserCost(1,"母婴类",100),
  UserCost(1,"生活用品",100),
  UserCost(2,"美食",79),
  UserCost(2,"电子类",80),
  UserCost(2,"生活用品",100)
))
var category=studentRDD.map(_.category).collect().distinct

studentRDD.toDF()
  .groupBy("id")
  .pivot($"category",category)
  .sum("cost")
  .show()
+---+------+------+--------+----+
| id|电子类|母婴类|生活用品|美食|
+---+------+------+--------+----+
|  1| 100.0| 100.0|   100.0|null|
|  2|  80.0|  null|   100.0|79.0|
+---+------+------+--------+----+

na

studentRDD.toDF()
  .groupBy("id")
  .pivot($"category",category)
  .sum("cost")
  .na.fill(Map("母婴类"-> 0,"美食"-> -1))
  .show()
+---+------+------+--------+----+
| id|电子类|母婴类|生活用品|美食|
+---+------+------+--------+----+
|  1| 100.0| 100.0|   100.0|-1.0|
|  2|  80.0|   0.0|   100.0|79.0|
+---+------+------+--------+----+

na针对一些不存在的值得处理,其中处理方案 fill指定null的默认值,drop删除含有null的行

cube

spark.sparkContext.makeRDD(List((110,50,80),(120,60,95),(120,50,96))).toDF("height","weight","score")
  .cube($"height",$"weight")
  .avg("score")
  .show()
+------+------+-----------------+
|height|weight|       avg(score)|
+------+------+-----------------+
|   110|    50|             80.0|
|   120|  null|             95.5|
|   120|    60|             95.0|
|  null|    60|             95.0|
|  null|  null|90.33333333333333|
|   120|    50|             96.0|
|   110|  null|             80.0|
|  null|    50|             88.0|
+------+------+-----------------+

cube立体计算,计算多个维度的数据,以上分别计算了身高和weight的各种组合情况下学生的得分。

join

var userCostDF=spark.sparkContext.parallelize(List(
  UserCost(1,"电脑配件",100),
  UserCost(1,"母婴用品",100),
  UserCost(1,"生活用品",100),
  UserCost(2,"居家美食",79),
  UserCost(2,"消费电子",80),
  UserCost(2,"生活用品",100)
)).toDF().as("t_user_cost")

var userDF=spark.sparkContext.parallelize(List(
  User(1,"张晓三",true,18,15000),
  User(2,"李晓四",true,18,18000),
  User(3,"王晓五",false,18,10000)
)).toDF().as("t_user" )

var categorys=userCostDF.select("category").dropDuplicates().rdd.map(_.getAs[String](0)).collect()

userDF.join(userCostDF,"id")
  .groupBy($"id",$"name")
  .pivot($"category",categorys)
  .sum("cost")
  .withColumnRenamed("name","用户名")
  .na.fill(0)
  .show()
+---+------+--------+--------+--------+--------+--------+
| id|用户名|居家美食|母婴用品|生活用品|消费电子|电脑配件|
+---+------+--------+--------+--------+--------+--------+
|  1|张晓三|     0.0|   100.0|   100.0|     0.0|   100.0|
|  2|李晓四|    79.0|     0.0|   100.0|    80.0|     0.0|
+---+------+--------+--------+--------+--------+--------+

dropDuplicates

var userCostDF=spark.sparkContext.parallelize(List(
  UserCost(1,"电脑配件",100),
  UserCost(1,"母婴用品",100),
  UserCost(1,"生活用品",100),
  UserCost(2,"居家美食",79),
  UserCost(2,"消费电子",80),
  UserCost(2,"生活用品",100)
)).toDF().as("t_user_cost")

userCostDF.select($"category").dropDuplicates("category").show()
+--------+
|category|
+--------+
|居家美食|
|母婴用品|
|生活用品|
|消费电子|
|电脑配件|
+--------+

drop

var userDF=spark.sparkContext.parallelize(List(
  User(1,"张晓三",true,18,15000),
  User(2,"李晓四",true,18,18000),
  User(3,"王晓五",false,18,10000)
)).toDF().as("t_user" )

userDF.drop($"sex").show()
+---+------+---+-------+
| id|  name|age| salary|
+---+------+---+-------+
|  1|张晓三| 18|15000.0|
|  2|李晓四| 18|18000.0|
|  3|王晓五| 18|10000.0|
+---+------+---+-------+

map

var userDF=spark.sparkContext.parallelize(List(
  User(1,"张晓三",true,18,15000),
  User(2,"李晓四",true,18,18000),
  User(3,"王晓五",false,18,10000)
)).toDF().as("t_user" )

userDF.withColumnRenamed("name","username")
.map(row=>(row.getAs[String]("username"),row.getAs[Int]("id")))
.show()
+------+---+
|    _1| _2|
+------+---+
|张晓三|  1|
|李晓四|  2|
|王晓五|  3|
+------+---+

flatMap

var df=spark.sparkContext.parallelize(List((1,"TV,GAME"),(2,"SLEEP,FOOT"))).toDF("id","hobbies")

df.flatMap(row=> row.getAs[String]("hobbies").split(",").map(t=>(row.getAs[Int]("id"),t)))
.toDF("id","hobby")
.show()
+---+-----+
| id|hobby|
+---+-----+
|  1|   TV|
|  1| GAME|
|  2|SLEEP|
|  2| FOOT|
+---+-----+

orderBy

val df=spark.sparkContext.parallelize(List((1,"TV,GAME"),(2,"SLEEP,FOOT"))).toDF("id","hobbies")
df.flatMap(row=> row.getAs[String]("hobbies").split(",").map(t=>(row.getAs[Int]("id"),t)))
  .toDF("id","hobby")
  .orderBy($"id" desc ,$"hobby" asc ) //等价sort
  .show()
+---+-----+
| id|hobby|
+---+-----+
|  2| FOOT|
|  2|SLEEP|
|  1| GAME|
|  1|   TV|
+---+-----+

limit

val df=spark.sparkContext.parallelize(List((1,"TV,GAME"),(2,"SLEEP,FOOT"))).toDF("id","hobbies")
df.flatMap(row=> row.getAs[String]("hobbies").split(",").map(t=>(row.getAs[Int]("id"),t))).toDF("id","hobby")
.orderBy($"id" desc ,$"hobby" asc ) //等价sort
.limit(3)
.show()
+---+-----+
| id|hobby|
+---+-----+
|  2| FOOT|
|  2|SLEEP|
|  1| GAME|
+---+-----+

Dataset/DataFrame SQL

Spark支持SQL查询,需要用户在使用之前创建视图。

Employee

Michael,29,20000,true,MANAGER,1
Andy,30,15000,true,SALESMAN,1
Justin,19,8000,true,CLERK,1
Kaine,20,20000,true,MANAGER,2
Lisa,19,18000,false,SALESMAN,2

Dept

1,研发
2,设计
3,产品
val userDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_employee.txt")
  .map(line => line.split(","))
  .map(t => Employee(t(0), t(1).toInt, t(2).toDouble, t(3).toBoolean, t(4), t(5).toInt))
  .toDS()

val deptDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_dept.txt")
      .map(line => line.split(","))
      .map(t => Dept(t(0).toInt, t(1)))
      .toDS()

//注册视图
deptDS.createOrReplaceTempView("t_dept")
userDS.createOrReplaceTempView("t_user")

SQL查询

spark.sql("select *, salary * 12 as annual_salary from t_user").show()
+-------+---+-------+-----+--------+------+-------------+
|   name|age| salary|  sex|     job|deptNo|annual_salary|
+-------+---+-------+-----+--------+------+-------------+
|Michael| 29|20000.0| true| MANAGER|     1|     240000.0|
|   Andy| 30|15000.0| true|SALESMAN|     1|     180000.0|
| Justin| 19| 8000.0| true|   CLERK|     1|      96000.0|
|  Kaine| 20|20000.0| true| MANAGER|     2|     240000.0|
|   Lisa| 19|18000.0|false|SALESMAN|     2|     216000.0|
+-------+---+-------+-----+--------+------+-------------+

like模糊

spark.sql("select * from t_user where deptNo=1 and (name like '%cha%' or name like '%us%')").show()
|   name|age|slalary| sex|    job|deptNo|
+-------+---+-------+----+-------+------+
|Michael| 29|20000.0|true|MANAGER|     1|
| Justin| 19| 8000.0|true|  CLERK|     1|
+-------+---+-------+----+-------+------+

排序查询

spark.sql("select * from t_user  order by salary desc,name desc ").show()
+-------+---+-------+-----+--------+------+
|   name|age| salary|  sex|     job|deptNo|
+-------+---+-------+-----+--------+------+
|Michael| 29|20000.0| true| MANAGER|     1|
|  Kaine| 20|20000.0| true| MANAGER|     2|
|   Lisa| 19|18000.0|false|SALESMAN|     2|
|   Andy| 30|15000.0| true|SALESMAN|     1|
| Justin| 19| 8000.0| true|   CLERK|     1|
+-------+---+-------+-----+--------+------+

limit查询

spark.sql("select * from t_user  order by salary desc,name desc limit 3").show()
|   name|age| salary|  sex|     job|deptNo|
+-------+---+-------+-----+--------+------+
|Michael| 29|20000.0| true| MANAGER|     1|
|  Kaine| 20|20000.0| true| MANAGER|     2|
|   Lisa| 19|18000.0|false|SALESMAN|     2|
+-------+---+-------+-----+--------+------+

分组查询

spark.sql("select deptNo,count(*) as  total , avg(salary) avg_salary from t_user group by deptNo").show()
|deptNo|total|        avg_salary|
+------+-----+------------------+
|     1|    3|14333.333333333334|
|     2|    2|           19000.0|
+------+-----+------------------+

having子句

var sql="select deptNo,count(*) as total,avg(salary) avg_salary from t_user group by deptNo having avg_salary > 15000"
spark.sql(sql).show()
+------+-----+----------+
|deptNo|total|avg_salary|
+------+-----+----------+
|     2|    2|   19000.0|
+------+-----+----------+

case-when

var sql="select name,salary, case sex when true then '男' when false then '女' else '未知' end as sex from t_user"
spark.sql(sql).show()
+-------+-------+---+
|   name| salary|sex|
+-------+-------+---+
|Michael|20000.0| 男|
|   Andy|15000.0| 男|
| Justin| 8000.0| 男|
|  Kaine|20000.0| 男|
|   Lisa|18000.0| 女|
+-------+-------+---+

等价写法"select name,salary, case when sex=true then '男' when sex=false then '女' else '未知' end as sex from t_user"别名不能是中文

行专列

case-when

val coursedf = spark.sparkContext.parallelize(List(
  StudentCourse(1, "语文", 100),
  StudentCourse(1, "数学", 100),
  StudentCourse(1, "英语", 100),
  StudentCourse(2, "数学", 79),
  StudentCourse(2, "语文", 80),
  StudentCourse(2, "英语", 100)
)).toDF()

coursedf.createOrReplaceTempView("t_course")
spark.sql("select id,max(case course when '语文' then score else 0 end) as chinese,max(case course when '数学' then score else 0 end ) as math,max(case course when '英语' then score else 0 end ) as english from t_course group by id ").show()
+---+-------+-----+-------+
| id|chinese| math|english|
+---+-------+-----+-------+
|  1|  100.0|100.0|  100.0|
|  2|   80.0| 79.0|  100.0|
+---+-------+-----+-------+

pivot

val userCostDF = spark.sparkContext.parallelize(List(
  StudentCourse(1, "语文", 100),
  StudentCourse(1, "数学", 100),
  StudentCourse(1, "英语", 100),
  StudentCourse(2, "数学", 79),
  StudentCourse(2, "语文", 80),
  StudentCourse(2, "英语", 100)
)).toDF()

userCostDF.createOrReplaceTempView("t_course")

spark.sql("select * from t_course pivot(max(score) for course in ('数学','语文','英语')) ")
.na.fill(0).show()
+---+-----+-----+-----+
| id| 数学| 语文| 英语|
+---+-----+-----+-----+
|  1|100.0|100.0|100.0|
|  2| 79.0| 80.0|100.0|
+---+-----+-----+-----+

这里需要注意,在书写SQL的时候除去聚合字段和输出列明字段,其他字段作为groupby后的隐藏字段。

表连接

spark.sql("select * from t_user left join t_dept on deptNo=id").show()
+-------+---+-------+-----+--------+------+---+----+
|   name|age| salary|  sex|     job|deptNo| id|name|
+-------+---+-------+-----+--------+------+---+----+
|Michael| 29|20000.0| true| MANAGER|     1|  1|研发|
|   Andy| 30|15000.0| true|SALESMAN|     1|  1|研发|
| Justin| 19| 8000.0| true|   CLERK|     1|  1|研发|
|  Kaine| 20|20000.0| true| MANAGER|     2|  2|设计|
|   Lisa| 19|18000.0|false|SALESMAN|     2|  2|设计|
+-------+---+-------+-----+--------+------+---+----+

spark支持inner join、left outer、right outer、full outer join连接

子查询

spark.sql("select * ,(select count(t1.salary) from t_user t1 where (t1.deptNo = t2.deptNo)  group by t1.deptNo) as total from t_user t2 left join t_dept on t2.deptNo=id order by t2.deptNo asc,t2.salary desc").show()
+-------+---+-------+-----+--------+------+---+----+-----+
|   name|age| salary|  sex|     job|deptNo| id|name|total|
+-------+---+-------+-----+--------+------+---+----+-----+
|Michael| 29|20000.0| true| MANAGER|     1|  1|研发|    3|
|   Andy| 30|15000.0| true|SALESMAN|     1|  1|研发|    3|
| Justin| 19| 8000.0| true|   CLERK|     1|  1|研发|    3|
|  Kaine| 20|20000.0| true| MANAGER|     2|  2|设计|    2|
|   Lisa| 19|18000.0|false|SALESMAN|     2|  2|设计|    2|
+-------+---+-------+-----+--------+------+---+----+-----+

注意在spark中仅仅支持内嵌子查询=查询

开窗函数

在正常的统计分析中 ,通常使用聚合函数作为分析,聚合分析函数的特点是将n行记录合并成一行,在数据库的统计当中 还有一种统计称为开窗统计,开窗函数可以实现将一行变成多行。可以将数据库查询的每一条记录比作是一幢高楼的一 层, 开窗函数就是在每一层开一扇窗, 让每一层能看到整装楼的全貌或一部分。

  • 输出员工信息以及所在部门的平均薪资
spark.sql("select u.name,u.deptNo,u.salary,d.name,avg(salary) over(partition by deptNo ) avg from t_user u left join t_dept d on deptNo=id").show()
+-------+------+-------+----+------------------+
|   name|deptNo| salary|name|               avg|
+-------+------+-------+----+------------------+
|Michael|     1|20000.0|研发|14333.333333333334|
|   Andy|     1|15000.0|研发|14333.333333333334|
| Justin|     1| 8000.0|研发|14333.333333333334|
|  Kaine|     2|20000.0|设计|           19000.0|
|   Lisa|     2|18000.0|设计|           19000.0|
+-------+------+-------+----+------------------+

等价写法

select u2.name,u2.deptNo,u2.salary,d.name,(select avg(salary) from t_user u1 where u1.deptNo=u2.deptNo group by u1.deptNo) avgSalary from t_user u2 left join t_dept d on deptNo=id
  • 计算员工在自己部门的薪资排序
spark.sql("select u.name,u.deptNo,u.salary,d.name,sum(1) over(partition by deptNo order by salary desc) rank from t_user u left join t_dept d on deptNo=id").show()
+-------+------+-------+----+----+
|   name|deptNo| salary|name|rank|
+-------+------+-------+----+----+
|Michael|     1|20000.0|研发|   1|
|   Andy|     1|15000.0|研发|   2|
| Justin|     1| 8000.0|研发|   3|
|  Kaine|     2|20000.0|设计|   1|
|   Lisa|     2|18000.0|设计|   2|
+-------+------+-------+----+----+

其中sum(1)可以替换成ROW_NUM()

  • 计算员工在公司的薪资排名
spark.sql("select t_user.name,deptNo,salary,t_dept.name,ROW_NUMBER() over(order by salary desc) rank from t_user left join t_dept on deptNo=id order by deptNo").show()
+-------+------+-------+----+----+
|   name|deptNo| salary|name|rank|
+-------+------+-------+----+----+
| Justin|     1| 8000.0|研发|   5|
|   Andy|     1|15000.0|研发|   4|
|Michael|     1|20000.0|研发|   1|
|  Kaine|     2|20000.0|设计|   2|
|   Lisa|     2|18000.0|设计|   3|
+-------+------+-------+----+----+

可以看出,ROW_NUMBER()只表示数据出库的顺序,无法比较真正的顺序。因此一般在做排名的时候一般使用RANK()或者DENSE_RANK()函数。

  • 使用RANK函数实现薪资排名(序号不连续)
spark.sql("select t_user.name,deptNo,salary,t_dept.name,RANK() over(order by salary desc) rank from t_user left join t_dept on deptNo=id order by deptNo").show()
+-------+------+-------+----+----+
|   name|deptNo| salary|name|rank|
+-------+------+-------+----+----+
| Justin|     1| 8000.0|研发|   5|
|   Andy|     1|15000.0|研发|   4|
|Michael|     1|20000.0|研发|   1|
|  Kaine|     2|20000.0|设计|   1|
|   Lisa|     2|18000.0|设计|   3|
+-------+------+-------+----+----+
  • 使用DENSE_RANK函数实现薪资排名(序号连续)
spark.sql("select t_user.name,deptNo,salary,t_dept.name,DENSE_RANK() over(order by salary desc) rank from t_user left join t_dept on deptNo=id order by deptNo").show()
+-------+------+-------+----+----+
|   name|deptNo| salary|name|rank|
+-------+------+-------+----+----+
| Justin|     1| 8000.0|研发|   4|
|   Andy|     1|15000.0|研发|   3|
|Michael|     1|20000.0|研发|   1|
|  Kaine|     2|20000.0|设计|   1|
|   Lisa|     2|18000.0|设计|   2|
+-------+------+-------+----+----+
  • 计算每个部门和本部门最高薪资的差值
spark.sql("select t_user.name,deptNo,salary,t_dept.name,(salary- max(salary) over(partition by deptNo order by salary desc rows between  unbounded preceding and current row)) diff from t_user left join t_dept on deptNo=id order by deptNo").show()
+-------+------+-------+----+--------+
|   name|deptNo| salary|name|    diff|
+-------+------+-------+----+--------+
|Michael|     1|20000.0|研发|     0.0|
|   Andy|     1|15000.0|研发| -5000.0|
| Justin|     1| 8000.0|研发|-12000.0|
|  Kaine|     2|20000.0|设计|     0.0|
|   Lisa|     2|18000.0|设计| -2000.0|
+-------+------+-------+----+--------+
  • 计算本部门的员工工资和最小工资的差值
spark.sql("select t_user.name,deptNo,salary,t_dept.name,(salary- min(salary) over(partition by deptNo order by salary desc rows between  current row and unbounded following)) diff from t_user left join t_dept on deptNo=id order by deptNo").show()
+-------+------+-------+----+-------+
|   name|deptNo| salary|name|   diff|
+-------+------+-------+----+-------+
| Justin|     1| 8000.0|研发|    0.0|
|   Andy|     1|15000.0|研发| 7000.0|
|Michael|     1|20000.0|研发|12000.0|
|  Kaine|     2|20000.0|设计| 2000.0|
|   Lisa|     2|18000.0|设计|    0.0|
+-------+------+-------+----+-------+
  • 计算本部门员工和公司平均薪资的差值
spark.sql("select t_user.name,deptNo,salary,t_dept.name,(salary- avg(salary) over(order by salary  rows between  unbounded preceding and unbounded following)) diff from t_user left join t_dept on deptNo=id order by deptNo").show()
+-------+------+-------+----+-------+
|   name|deptNo| salary|name|   diff|
+-------+------+-------+----+-------+
|Michael|     1|20000.0|研发| 3800.0|
|   Andy|     1|15000.0|研发|-1200.0|
| Justin|     1| 8000.0|研发|-8200.0|
|   Lisa|     2|18000.0|设计| 1800.0|
|  Kaine|     2|20000.0|设计| 3800.0|
+-------+------+-------+----+-------+

总结

聚合函数(字段) over ([partition by 字段] order by 字段 asc [rows between 起始行偏移量 and 终止偏移量] )

其中:偏移量的取值

preceding:用于累加前N行(分区之内)。若是从分区第一行头开始,则为 unbounded。 N为:相对当前行向前的偏移量。
following:与preceding相反,累加后N行(分区之内)。若是累加到该分区结束则为unbounded。N为:相对当前行向后的偏移量
current row:顾名思义,当前行,偏移量为0

说明:上边的前N,后M,以及current row均会累加该偏移量所在行

自定义函数

单行函数

更具员工的职位,计算年薪

spark.udf.register("annual_salary",(job:String,salary:Double)=>{
  job match {
    case "MANAGER" => salary*12 + 5000000
    case "SALESMAN" => salary*12 + 100000
    case "CLERK" => salary*12 + 20000
    case _ => salary*12
  }
})
spark.sql("select * ,annual_salary(job,salary) annual_salary from t_user").show()
+-------+---+-------+-----+--------+------+-------------+
|   name|age| salary|  sex|     job|deptNo|annual_salary|
+-------+---+-------+-----+--------+------+-------------+
|Michael| 29|20000.0| true| MANAGER|     1|    5240000.0|
|   Andy| 30|15000.0| true|SALESMAN|     1|     280000.0|
| Justin| 19| 8000.0| true|   CLERK|     1|     116000.0|
|  Kaine| 20|20000.0| true| MANAGER|     2|    5240000.0|
|   Lisa| 19|18000.0|false|SALESMAN|     2|     316000.0|
+-------+---+-------+-----+--------+------+-------------+

如果使用API形式调用可以使用:

userDS.selectExpr("name","age","job","salary","annual_salary(job,salary) as annual_salary").show()

聚合函数(了解)

有类型聚合(SQL)

自定义UserDefinedAggregateFunction实现类

import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types.{DataType, DoubleType, StructType}

class DeptSalarySum  extends UserDefinedAggregateFunction{
  //说明输入的Schema
  override def inputSchema: StructType = {
    new StructType().add("salary","double")
  }
  //缓冲临时变量
  override def bufferSchema: StructType = {
    new StructType().add("total","double")
  }
  //返回值类型
  override def dataType: DataType = {
    DoubleType
  }
  //表示系统给定的一组输入,总有固定的输出类型
  override def deterministic: Boolean = true

  override def initialize(buffer: MutableAggregationBuffer): Unit = {
    buffer.update(0,0.0D)
  }

  //局部计算
  override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
    val history = buffer.getDouble(0)
    buffer.update(0,history + input.getAs[Double](0))
  }
  //中间结果聚合
  override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
    buffer1.update(0,buffer1.getDouble(0)+buffer2.getDouble(0))
  }
  //返回最终结果
  override def evaluate(buffer: Row): Any = {
    buffer.getDouble(0)
  }
}

注册聚合类

spark.udf.register("deptSalarySum",new DeptSalarySum)

调用聚合函数

spark.sql("select deptNo,deptSalarySum(salary) as salary from t_user group by deptNo").show()
+------+---------------------+
|deptNo|deptsalarysum(salary)|
+------+---------------------+
|     1|              43000.0|
|     2|              38000.0|
+------+---------------------+
强类型聚合(API)

强类型聚合用户需要实现Aggregate接口同时需要定制State变量用于存储聚合过程中的中间变量。用于实现针对Dataset数据的集合的聚合。

DeptSalarySate

class DeptSalarySate(var initValue:Double=0 ) {
  def add(v:Double): Unit ={
    initValue += v
  }
  def get():Double={
    initValue
  }
}

DeptSalaryAggregator

import org.apache.spark.sql.{Encoder, Encoders}
import org.apache.spark.sql.expressions.Aggregator

class DeptSalaryAggregator extends Aggregator[Employee,DeptSalarySate,Double]{
  override def zero: DeptSalarySate = new DeptSalarySate()

  override def reduce(b: DeptSalarySate, a: Employee): DeptSalarySate ={
    b.add(a.salary)
    b
  }

  override def merge(b1: DeptSalarySate, b2: DeptSalarySate): DeptSalarySate = {
    b1.add(b2.get())
    b1
  }

  override def finish(reduction: DeptSalarySate): Double = {
    reduction.get()
  }

  override def bufferEncoder: Encoder[DeptSalarySate] = {
    //Encoders.product[DeptSalarySate]
    Encoders.kryo(classOf[DeptSalarySate])
  }

  override def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}

测试程序

var myagg=new DeptSalaryAggregator().toColumn.name("avg")
userDS.select(myagg).show()
+-------+
|    avg|
+-------+
|81000.0|
+-------+

Load/Save

paquet文件

生成parquet文件

Parquet仅仅是一种存储格式,它是语言、平台无关的,并且不需要和任何一种数据处理框架绑定.

val spark = SparkSession.builder()
    .master("local[6]")
    .appName("sql")
    .getOrCreate()

import spark.implicits._

spark.sparkContext.setLogLevel("FATAL")
val userDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_employee.txt")
      .map(line => line.split(","))
      .map(t => Employee(t(0), t(1).toInt, t(2).toDouble, t(3).toBoolean, t(4), t(5).toInt))
      .toDS().as("u")

val deptDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_dept.txt")
      .map(line => line.split(","))
      .map(t => Dept(t(0).toInt, t(1)))
      .toDS().as("d")

userDS.select("name","sex","salary","deptNo")
      .withColumn("annual_salary",$"salary"*12)
      .join(deptDS,$"u.deptNo" ===$"d.id" ,"left_outer")
      .drop("id")
      .toDF("name","sex","salary","dept_no","annual_salary","dept_name")
      .write
      .save("file:///Users/jiangzz/Desktop/results/parquet/results.parquet")

spark.close()

读取Parquet文件

val spark = SparkSession.builder()
                .master("local[6]")
                .appName("sql")
                .getOrCreate()

import spark.implicits._

spark.sparkContext.setLogLevel("FATAL")
//spark.read.parquet 等价
spark.read.load("file:///Users/jiangzz/Desktop/results/parquet/results.parquet")
      .show()

spark.close()
+-------+-----+-------+-------+-------------+---------+
|   name|  sex| salary|dept_no|annual_salary|dept_name|
+-------+-----+-------+-------+-------------+---------+
|Michael| true|20000.0|      1|     240000.0|     研发|
|   Andy| true|15000.0|      1|     180000.0|     研发|
| Justin| true| 8000.0|      1|      96000.0|     研发|
|  Kaine| true|20000.0|      2|     240000.0|     设计|
|   Lisa|false|18000.0|      2|     216000.0|     设计|
+-------+-----+-------+-------+-------------+---------+

Json格式

产生json格式

val spark = SparkSession.builder()
.master("local[6]")
.appName("sql")
.getOrCreate()

import spark.implicits._

spark.sparkContext.setLogLevel("FATAL")
val userDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_employee.txt")
    .map(line => line.split(","))
    .map(t => Employee(t(0), t(1).toInt, t(2).toDouble, t(3).toBoolean, t(4), t(5).toInt))
    .toDS().as("u")

val deptDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_dept.txt")
    .map(line => line.split(","))
    .map(t => Dept(t(0).toInt, t(1)))
    .toDS().as("d")

userDS.select("name","sex","salary","deptNo")
    .withColumn("annual_salary",$"salary"*12)
    .join(deptDS,$"u.deptNo" ===$"d.id" ,"left_outer")
    .drop("id")
    .toDF("name","sex","salary","dept_no","annual_salary","dept_name")
    .write
    .mode(SaveMode.Ignore)
    .format("json")
    .save("file:///Users/jiangzz/Desktop/results/json/")

spark.close()

读取Json格式

val spark = SparkSession.builder()
    .master("local[6]")
    .appName("sql")
    .getOrCreate()
spark.sparkContext.setLogLevel("FATAL")
import spark.implicits._

spark.read.json("file:///Users/jiangzz/Desktop/results/json/")
    .show()

spark.close()

csv格式

生成

val spark = SparkSession.builder()
    .master("local[6]")
    .appName("sql")
    .getOrCreate()

import spark.implicits._

spark.sparkContext.setLogLevel("FATAL")
val userDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_employee.txt")
    .map(line => line.split(","))
    .map(t => Employee(t(0), t(1).toInt, t(2).toDouble, t(3).toBoolean, t(4), t(5).toInt))
    .toDS().as("u")

val deptDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_dept.txt")
    .map(line => line.split(","))
    .map(t => Dept(t(0).toInt, t(1)))
    .toDS().as("d")

userDS.select("name","sex","salary","deptNo")
    .withColumn("annual_salary",$"salary"*12)
    .join(deptDS,$"u.deptNo" ===$"d.id" ,"left_outer")
    .drop("id")
    .toDF("name","sex","salary","dept_no","annual_salary","dept_name")
    .write
    .mode(SaveMode.Ignore)
    .format("csv")
    .option("sep", ";")
    .option("inferSchema", "true")
    .option("header", "true")
    .save("file:///Users/jiangzz/Desktop/results/csv/")

spark.close()

读取

val spark = SparkSession.builder()
    .master("local[6]")
    .appName("sql")
    .getOrCreate()

import spark.implicits._

spark.sparkContext.setLogLevel("FATAL")

spark.read
    .option("sep", ";")
    .option("inferSchema", "true")
    .option("header", "true")
    .csv("file:///Users/jiangzz/Desktop/results/csv/")
    .show()

spark.close()

ORC格式

ORC的全称是(Optimized Row Columnar),ORC文件格式是一种Hadoop生态圈中的列式存储格式,它的产生早在2013年初,最初产生自Apache Hive,用于降低Hadoop数据存储空间和加速Hive查询速度。

生成

val spark = SparkSession.builder()
    .master("local[6]")
    .appName("sql")
    .getOrCreate()

import spark.implicits._

spark.sparkContext.setLogLevel("FATAL")
val userDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_employee.txt")
    .map(line => line.split(","))
    .map(t => Employee(t(0), t(1).toInt, t(2).toDouble, t(3).toBoolean, t(4), t(5).toInt))
    .toDS().as("u")

val deptDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_dept.txt")
    .map(line => line.split(","))
    .map(t => Dept(t(0).toInt, t(1)))
    .toDS().as("d")

userDS.select("name","sex","salary","deptNo")
    .withColumn("annual_salary",$"salary"*12)
    .join(deptDS,$"u.deptNo" ===$"d.id" ,"left_outer")
    .drop("id")
    .toDF("name","sex","salary","dept_no","annual_salary","dept_name")
    .write
    .mode(SaveMode.Ignore)
    .format("orc")
    .option("orc.bloom.filter.columns", "favorite_color")
    .option("orc.dictionary.key.threshold", "1.0")
    .save("file:///Users/jiangzz/Desktop/results/orc/")

spark.close()

读取

val spark = SparkSession.builder()
    .master("local[6]")
    .appName("sql")
    .getOrCreate()

import spark.implicits._

spark.sparkContext.setLogLevel("FATAL")
val userDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_employee.txt")
    .map(line => line.split(","))
    .map(t => Employee(t(0), t(1).toInt, t(2).toDouble, t(3).toBoolean, t(4), t(5).toInt))
    .toDS().as("u")

val deptDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_dept.txt")
    .map(line => line.split(","))
    .map(t => Dept(t(0).toInt, t(1)))
    .toDS().as("d")

spark.read
    .option("orc.bloom.filter.columns", "favorite_color")
    .option("orc.dictionary.key.threshold", "1.0")
    .orc("file:///Users/jiangzz/Desktop/results/orc/")
    .show()

spark.close()

SQL读取文件

val spark = SparkSession.builder()
    .master("local[6]")
    .appName("sql")
    .getOrCreate()

import spark.implicits._

spark.sparkContext.setLogLevel("FATAL")

val parqeutDF = spark.sql("SELECT * FROM parquet.`file:///Users/jiangzz/Desktop/results/parquet/results.parquet`")
val jsonDF = spark.sql("SELECT * FROM json.`file:///Users/jiangzz/Desktop/results/json/`")
val orcDF = spark.sql("SELECT * FROM orc.`file:///Users/jiangzz/Desktop/results/orc/`")
//parqeutDF.show()
//jsonDF.show()
//csvDF.show()
orcDF.show()
spark.close()

JDBC数据库读取

读取MysQL

<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>5.1.47</version>
</dependency>
val spark = SparkSession.builder()
    .master("local[6]")
    .appName("sql")
    .getOrCreate()

import spark.implicits._

spark.sparkContext.setLogLevel("FATAL")
spark.read
    .format("jdbc")
    .option("url", "jdbc:mysql://CentOS:3306/test")
    .option("dbtable", "t_user")
    .option("user", "root")
    .option("password", "root")
    .load().createTempView("t_user")


spark.sql("select * from t_user").show()

spark.close()

写入MySQL

val spark = SparkSession.builder()
    .master("local[6]")
    .appName("sql")
    .getOrCreate()

import spark.implicits._

spark.sparkContext.setLogLevel("FATAL")
val userDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_employee.txt")
    .map(line => line.split(","))
    .map(t => Employee(t(0), t(1).toInt, t(2).toDouble, t(3).toBoolean, t(4), t(5).toInt))
    .toDS().as("u")

val deptDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_dept.txt")
    .map(line => line.split(","))
    .map(t => Dept(t(0).toInt, t(1)))
    .toDS().as("d")

val props = new Properties()
    props.put("user", "root")
    props.put("password", "root")

userDS.select("name","sex","salary","deptNo")
    .withColumn("annual_salary",$"salary"*12)
    .join(deptDS,$"u.deptNo" ===$"d.id" ,"left_outer")
    .drop("id")
    .toDF("name","sex","salary","dept_no","annual_salary","dept_name")
    .write
    .mode("append")
		.jdbc("jdbc:mysql://CentOS:3306/test","t_user",props)

spark.close()