Spark SQL 编程指南
Spark SQL是用于结构化数据处理的一个模块。同Spark RDD 不同地方在于Spark SQL的API可以给Spark计算引擎提供更多地 信息,例如:数据结构、计算算子等。在内部Spark可以通过这些信息有针对对任务做优化和调整。这里有几种方式和Spark SQL进行交互,例如Dataset API和SQL等,这两种API可以混合使用。Spark SQL的一个用途是执行SQL查询。 Spark SQL还可用于从现有Hive安装中读取数据。从其他编程语言中运行SQL时,结果将作为Dataset/DataFrame返回,使用命令 行或JDBC / ODBC与SQL接口进行交互。
Dataset是一个分布式数据集合在Spark 1.6提供一个新的接口,Dataset提供RDD的优势(强类型,使用强大的lambda函 数)以及具备了Spark SQL执行引擎的优点。Dataset可以通过JVM对象构建,然后可以使用转换函数等(例如:map、flatMap、filter等),目前Dataset API支持Scala和Java 目前Python对Dataset支持还不算完备。
DataFrame是命名列的数据集,他在概念是等价于关系型数据库。DataFrames可以从很多地方构建,比如说结构化数据文 件、hive中的表或者外部数据库,使用Dataset[row]的数据集,可以理解DataFrame就是一个Dataset[Row].
SparkSession
Spark中所有功能的入口点是SparkSession类。要创建基本的SparkSession,只需使用SparkSession.buildeåçr():
- 依赖
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.3</version>
</dependency>
- Drvier程序
//1.创建SparkSession
val spark = SparkSession.builder()
.appName("hellosql")
.master("local[10]")
.getOrCreate()
//2.引入改隐试转换 主要是 将 RDD 转换为 DataFrame/Dataset
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL")
//关闭spark
spark.stop()
Dataset
Dataset与RDD类似,但是它们不使用Java序列化或Kryo,而是使用专用的Encoder来序列化对象以便通过网络进行处理或传输。虽然Encoder和标准序列化都负责将对象转换为字节,但Encoder是动态生成的代码,并使用一种格式,允许Spark执行许多操作,如过滤,排序和散列,而无需将字节反序列化为对象。
- case-class
case class Person(id:Int,name:String,age:Int,sex:Boolean)
val dataset: Dataset[Person] =List(Person(1,"zhangsan",18,true),Person(2,"wangwu",28,true)).toDS()
dataset.select($"id",$"name").show()
- Tuple元组
val dataset: Dataset[(Int,String,Int,Boolean)] = List((1,"zhangsan",18,true),(2,"wangwu",28,true)).toDS()
dataset.select($"_1",$"_2").show()
//或者
dataset.selectExpr("_1 as id","_2 as name","(_3 * 10) as age").show()
- json数据
{"name":"张三","age":18}
{"name":"lisi","age":28}
{"name":"wangwu","age":38}
case class Person(id:Int,name:String,age:Int,sex:Boolean)
val dataset = spark.read.json("D:///Persion.json").as[Person]
dataset.show()
- rdd
元组
val userRDD = spark.sparkContext.makeRDD(List((1,"张三",true,18,15000.0)))
userRDD.toDS().show()
+---+----+----+---+-------+
| _1| _2| _3| _4| _5|
+---+----+----+---+-------+
| 1|张三|true| 18|15000.0|
+---+----+----+---+-------+
case-class
val userRDD = spark.sparkContext.makeRDD(List(User(1,"张三",true,18,15000.0)))
userRDD.toDS().show()
+---+----+----+---+-------+
| id|name| sex|age| salary|
+---+----+----+---+-------+
| 1|张三|true| 18|15000.0|
+---+----+----+---+-------+
DataFrame
Data Frame是命名列的数据集,他在概念是等价于关系型数据库。DataFrames可以从很多地方构建,比如说结构化数据文 件、hive中的表或者外部数据库,使用Dataset[row]的数据集,可以理解DataFrame就是一个Dataset[Row].
- json文件
val frame = spark.read.json("file:///f:/person.json")
frame.show()
- case-class
List(Person("zhangsan",18),Person("王五",20)).toDF("uname","uage").show()
- Tuple元组
List(("zhangsan",18),("王五",20)).toDF("name","age").show()
- RDD转换
Row
val userRDD = spark.sparkContext.makeRDD(List((1,"张三",true,18,15000.0)))
.map(t=>Row(t._1,t._2,t._3,t._4,t._5))
var schema=new StructType()
.add("id","int")
.add("name","string")
.add("sex","boolean")
.add("age","int")
.add("salary","double")
spark.createDataFrame(userRDD,schema).show()
+---+----+----+---+-------+
| id|name| sex|age| salary|
+---+----+----+---+-------+
| 1|张三|true| 18|15000.0|
+---+----+----+---+-------+
Javabean
val userRDD = spark.sparkContext.makeRDD(List(new User(1,"张三",true,18,15000.0)))
spark.createDataFrame(userRDD,classOf[User]).show()
提示:这里的User
必须是JavaBean对象。
+---+----+----+---+-------+
| id|name| sex|age| salary|
+---+----+----+---+-------+
| 1|张三|true| 18|15000.0|
+---+----+----+---+-------+
case-class
val userRDD = spark.sparkContext.makeRDD(List(User(1,"张三",true,18,15000.0)))
spark.createDataFrame(userRDD).show()
+---+----+----+---+-------+
| id|name| sex|age| salary|
+---+----+----+---+-------+
| 1|张三|true| 18|15000.0|
+---+----+----+---+-------+
tuple元组
val userRDD = spark.sparkContext.makeRDD(List((1,"张三",true,18,15000.0)))
spark.createDataFrame(userRDD).show()
+---+----+----+---+-------+
| _1| _2| _3| _4| _5|
+---+----+----+---+-------+
| 1|张三|true| 18|15000.0|
+---+----+----+---+-------+
Dataset/DataFrame API
有如下测试数据:
val users=List("1,Michael,false,29,2000",
"2,Andy,true,30,5000",
"3,Justin,true,19,1000",
"4,Kaine,false,20,5000",
"5,Lisa,false,19,1000")
val userRDD = spark.sparkContext.parallelize(users,3)
.map(line=>{
val tokens = line.split(",")
User(tokens(0).toInt,tokens(1),tokens(2).toBoolean,tokens(3).toInt,tokens(4).toDouble)
})
select
userRDD.toDF().select($"id",$"name",$"sex",$"salary"*12 as "年薪").show()
+---+-------+-----+-------+
| id| name| sex| 年薪|
+---+-------+-----+-------+
| 1|Michael|false|24000.0|
| 2| Andy| true|60000.0|
| 3| Justin| true|12000.0|
| 4| Kaine|false|60000.0|
| 5| Lisa|false|12000.0|
+---+-------+-----+-------+
filter
userRDD.toDF()
.select($"id",$"name",$"sex",$"salary"*12 as "年薪")
.filter($"年薪">50000)
.show()
+---+-----+-----+-------+
| id| name| sex| 年薪|
+---+-----+-----+-------+
| 2| Andy| true|60000.0|
| 4|Kaine|false|60000.0|
+---+-----+-----+-------+
where
userRDD.toDF()
.select($"id",$"name",$"sex",$"salary"*12 as "annual_salary")
.where($"name" like "%i%" and $"年薪" > 12000) //等价 "name like '%i%' and annual_salary > 12000.0D"
.show()
+---+-------+-----+-------------+
| id| name| sex|annual_salary|
+---+-------+-----+-------------+
| 1|Michael|false| 24000.0|
| 4| Kaine|false| 60000.0|
+---+-------+-----+-------------+
特别注意,目前Spark对别名中含有中文支持不是多么友好,在使用stringexpress的时候存在bug.
withColumn
userRDD.toDF()
.select($"id",$"name",$"sex",$"salary"*12 as "annual_salary")
.where("name like '%i%' and annual_salary > 12000.0D")
.withColumn("annual_reward",$"annual_salary" * 0.8)
.show()
+---+-------+-----+-------------+-------------+
| id| name| sex|annual_salary|annual_reward|
+---+-------+-----+-------------+-------------+
| 1|Michael|false| 24000.0| 19200.0|
| 4| Kaine|false| 60000.0| 48000.0|
+---+-------+-----+-------------+-------------+
withColumnRenamed
var userDF=spark.sparkContext.parallelize(List(
User(1,"张晓三",true,18,15000),
User(2,"李晓四",true,18,18000),
User(3,"王晓五",false,18,10000)
)).toDF().as("t_user" )
userDF.withColumnRenamed("name","username").show()
+---+--------+-----+---+-------+
| id|username| sex|age| salary|
+---+--------+-----+---+-------+
| 1| 张晓三| true| 18|15000.0|
| 2| 李晓四| true| 18|18000.0|
| 3| 王晓五|false| 18|10000.0|
+---+--------+-----+---+-------+
groupBy
userRDD.toDF()
.select($"id",$"name",$"sex",$"salary"*12 as "annual_salary")
.groupBy($"sex")
.avg("annual_salary")//等价 mean
.show()
+-----+------------------+
| sex|avg(annual_salary)|
+-----+------------------+
| true| 36000.0|
|false| 32000.0|
+-----+------------------+
聚合算子:sum、max、min、mean、avg
agg
userRDD.toDF()
.select($"id",$"name",$"sex",$"salary")
.groupBy($"sex")
.agg(Map("salary"->"max","salary"->"mean"))
.show()
+-----+------------------+
| sex| avg(salary)|
+-----+------------------+
| true| 3000.0|
|false|2666.6666666666665|
+-----+------------------+
等价用法
import org.apache.spark.sql.functions._
userRDD.toDF()
.select($"id",$"name",$"sex",$"salary")
.groupBy($"sex")
.agg(avg($"salary") as "avg_salary",sum($"salary") as "total_salary" )
.show()
+-----+------------------+------------+
| sex| avg_salary|total_salary|
+-----+------------------+------------+
| true| 3000.0| 6000.0|
|false|2666.6666666666665| 8000.0|
+-----+------------------+------------+
开窗函数
import org.apache.spark.sql.functions._
var w=Window.partitionBy("sex")
.orderBy("salary")
.rangeBetween(Window.unboundedPreceding,Window.unboundedFollowing)
userRDD.toDS().select("id","sex","name","salary")
.withColumn("avg_salary",avg($"salary").over(w))
.show()
+---+-----+-------+------+------------------+
| id| sex| name|salary| avg_salary|
+---+-----+-------+------+------------------+
| 3| true| Justin|1000.0| 3000.0|
| 2| true| Andy|5000.0| 3000.0|
| 5|false| Lisa|1000.0|2666.6666666666665|
| 1|false|Michael|2000.0|2666.6666666666665|
| 4|false| Kaine|5000.0|2666.6666666666665|
+---+-----+-------+------+------------------+
pivot
var studentRDD=spark.sparkContext.parallelize(List(
UserCost(1,"电子类",100),
UserCost(1,"母婴类",100),
UserCost(1,"生活用品",100),
UserCost(2,"美食",79),
UserCost(2,"电子类",80),
UserCost(2,"生活用品",100)
))
var category=studentRDD.map(_.category).collect().distinct
studentRDD.toDF()
.groupBy("id")
.pivot($"category",category)
.sum("cost")
.show()
+---+------+------+--------+----+
| id|电子类|母婴类|生活用品|美食|
+---+------+------+--------+----+
| 1| 100.0| 100.0| 100.0|null|
| 2| 80.0| null| 100.0|79.0|
+---+------+------+--------+----+
na
studentRDD.toDF()
.groupBy("id")
.pivot($"category",category)
.sum("cost")
.na.fill(Map("母婴类"-> 0,"美食"-> -1))
.show()
+---+------+------+--------+----+
| id|电子类|母婴类|生活用品|美食|
+---+------+------+--------+----+
| 1| 100.0| 100.0| 100.0|-1.0|
| 2| 80.0| 0.0| 100.0|79.0|
+---+------+------+--------+----+
na针对一些不存在的值得处理,其中处理方案 fill指定null的默认值,drop删除含有null的行
cube
spark.sparkContext.makeRDD(List((110,50,80),(120,60,95),(120,50,96))).toDF("height","weight","score")
.cube($"height",$"weight")
.avg("score")
.show()
+------+------+-----------------+
|height|weight| avg(score)|
+------+------+-----------------+
| 110| 50| 80.0|
| 120| null| 95.5|
| 120| 60| 95.0|
| null| 60| 95.0|
| null| null|90.33333333333333|
| 120| 50| 96.0|
| 110| null| 80.0|
| null| 50| 88.0|
+------+------+-----------------+
cube立体计算,计算多个维度的数据,以上分别计算了身高和weight的各种组合情况下学生的得分。
join
var userCostDF=spark.sparkContext.parallelize(List(
UserCost(1,"电脑配件",100),
UserCost(1,"母婴用品",100),
UserCost(1,"生活用品",100),
UserCost(2,"居家美食",79),
UserCost(2,"消费电子",80),
UserCost(2,"生活用品",100)
)).toDF().as("t_user_cost")
var userDF=spark.sparkContext.parallelize(List(
User(1,"张晓三",true,18,15000),
User(2,"李晓四",true,18,18000),
User(3,"王晓五",false,18,10000)
)).toDF().as("t_user" )
var categorys=userCostDF.select("category").dropDuplicates().rdd.map(_.getAs[String](0)).collect()
userDF.join(userCostDF,"id")
.groupBy($"id",$"name")
.pivot($"category",categorys)
.sum("cost")
.withColumnRenamed("name","用户名")
.na.fill(0)
.show()
+---+------+--------+--------+--------+--------+--------+
| id|用户名|居家美食|母婴用品|生活用品|消费电子|电脑配件|
+---+------+--------+--------+--------+--------+--------+
| 1|张晓三| 0.0| 100.0| 100.0| 0.0| 100.0|
| 2|李晓四| 79.0| 0.0| 100.0| 80.0| 0.0|
+---+------+--------+--------+--------+--------+--------+
dropDuplicates
var userCostDF=spark.sparkContext.parallelize(List(
UserCost(1,"电脑配件",100),
UserCost(1,"母婴用品",100),
UserCost(1,"生活用品",100),
UserCost(2,"居家美食",79),
UserCost(2,"消费电子",80),
UserCost(2,"生活用品",100)
)).toDF().as("t_user_cost")
userCostDF.select($"category").dropDuplicates("category").show()
+--------+
|category|
+--------+
|居家美食|
|母婴用品|
|生活用品|
|消费电子|
|电脑配件|
+--------+
drop
var userDF=spark.sparkContext.parallelize(List(
User(1,"张晓三",true,18,15000),
User(2,"李晓四",true,18,18000),
User(3,"王晓五",false,18,10000)
)).toDF().as("t_user" )
userDF.drop($"sex").show()
+---+------+---+-------+
| id| name|age| salary|
+---+------+---+-------+
| 1|张晓三| 18|15000.0|
| 2|李晓四| 18|18000.0|
| 3|王晓五| 18|10000.0|
+---+------+---+-------+
map
var userDF=spark.sparkContext.parallelize(List(
User(1,"张晓三",true,18,15000),
User(2,"李晓四",true,18,18000),
User(3,"王晓五",false,18,10000)
)).toDF().as("t_user" )
userDF.withColumnRenamed("name","username")
.map(row=>(row.getAs[String]("username"),row.getAs[Int]("id")))
.show()
+------+---+
| _1| _2|
+------+---+
|张晓三| 1|
|李晓四| 2|
|王晓五| 3|
+------+---+
flatMap
var df=spark.sparkContext.parallelize(List((1,"TV,GAME"),(2,"SLEEP,FOOT"))).toDF("id","hobbies")
df.flatMap(row=> row.getAs[String]("hobbies").split(",").map(t=>(row.getAs[Int]("id"),t)))
.toDF("id","hobby")
.show()
+---+-----+
| id|hobby|
+---+-----+
| 1| TV|
| 1| GAME|
| 2|SLEEP|
| 2| FOOT|
+---+-----+
orderBy
val df=spark.sparkContext.parallelize(List((1,"TV,GAME"),(2,"SLEEP,FOOT"))).toDF("id","hobbies")
df.flatMap(row=> row.getAs[String]("hobbies").split(",").map(t=>(row.getAs[Int]("id"),t)))
.toDF("id","hobby")
.orderBy($"id" desc ,$"hobby" asc ) //等价sort
.show()
+---+-----+
| id|hobby|
+---+-----+
| 2| FOOT|
| 2|SLEEP|
| 1| GAME|
| 1| TV|
+---+-----+
limit
val df=spark.sparkContext.parallelize(List((1,"TV,GAME"),(2,"SLEEP,FOOT"))).toDF("id","hobbies")
df.flatMap(row=> row.getAs[String]("hobbies").split(",").map(t=>(row.getAs[Int]("id"),t))).toDF("id","hobby")
.orderBy($"id" desc ,$"hobby" asc ) //等价sort
.limit(3)
.show()
+---+-----+
| id|hobby|
+---+-----+
| 2| FOOT|
| 2|SLEEP|
| 1| GAME|
+---+-----+
Dataset/DataFrame SQL
Spark支持SQL查询,需要用户在使用之前创建视图。
Employee
Michael,29,20000,true,MANAGER,1
Andy,30,15000,true,SALESMAN,1
Justin,19,8000,true,CLERK,1
Kaine,20,20000,true,MANAGER,2
Lisa,19,18000,false,SALESMAN,2
Dept
1,研发
2,设计
3,产品
val userDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_employee.txt")
.map(line => line.split(","))
.map(t => Employee(t(0), t(1).toInt, t(2).toDouble, t(3).toBoolean, t(4), t(5).toInt))
.toDS()
val deptDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_dept.txt")
.map(line => line.split(","))
.map(t => Dept(t(0).toInt, t(1)))
.toDS()
//注册视图
deptDS.createOrReplaceTempView("t_dept")
userDS.createOrReplaceTempView("t_user")
SQL查询
spark.sql("select *, salary * 12 as annual_salary from t_user").show()
+-------+---+-------+-----+--------+------+-------------+
| name|age| salary| sex| job|deptNo|annual_salary|
+-------+---+-------+-----+--------+------+-------------+
|Michael| 29|20000.0| true| MANAGER| 1| 240000.0|
| Andy| 30|15000.0| true|SALESMAN| 1| 180000.0|
| Justin| 19| 8000.0| true| CLERK| 1| 96000.0|
| Kaine| 20|20000.0| true| MANAGER| 2| 240000.0|
| Lisa| 19|18000.0|false|SALESMAN| 2| 216000.0|
+-------+---+-------+-----+--------+------+-------------+
like模糊
spark.sql("select * from t_user where deptNo=1 and (name like '%cha%' or name like '%us%')").show()
| name|age|slalary| sex| job|deptNo|
+-------+---+-------+----+-------+------+
|Michael| 29|20000.0|true|MANAGER| 1|
| Justin| 19| 8000.0|true| CLERK| 1|
+-------+---+-------+----+-------+------+
排序查询
spark.sql("select * from t_user order by salary desc,name desc ").show()
+-------+---+-------+-----+--------+------+
| name|age| salary| sex| job|deptNo|
+-------+---+-------+-----+--------+------+
|Michael| 29|20000.0| true| MANAGER| 1|
| Kaine| 20|20000.0| true| MANAGER| 2|
| Lisa| 19|18000.0|false|SALESMAN| 2|
| Andy| 30|15000.0| true|SALESMAN| 1|
| Justin| 19| 8000.0| true| CLERK| 1|
+-------+---+-------+-----+--------+------+
limit查询
spark.sql("select * from t_user order by salary desc,name desc limit 3").show()
| name|age| salary| sex| job|deptNo|
+-------+---+-------+-----+--------+------+
|Michael| 29|20000.0| true| MANAGER| 1|
| Kaine| 20|20000.0| true| MANAGER| 2|
| Lisa| 19|18000.0|false|SALESMAN| 2|
+-------+---+-------+-----+--------+------+
分组查询
spark.sql("select deptNo,count(*) as total , avg(salary) avg_salary from t_user group by deptNo").show()
|deptNo|total| avg_salary|
+------+-----+------------------+
| 1| 3|14333.333333333334|
| 2| 2| 19000.0|
+------+-----+------------------+
having子句
var sql="select deptNo,count(*) as total,avg(salary) avg_salary from t_user group by deptNo having avg_salary > 15000"
spark.sql(sql).show()
+------+-----+----------+
|deptNo|total|avg_salary|
+------+-----+----------+
| 2| 2| 19000.0|
+------+-----+----------+
case-when
var sql="select name,salary, case sex when true then '男' when false then '女' else '未知' end as sex from t_user"
spark.sql(sql).show()
+-------+-------+---+
| name| salary|sex|
+-------+-------+---+
|Michael|20000.0| 男|
| Andy|15000.0| 男|
| Justin| 8000.0| 男|
| Kaine|20000.0| 男|
| Lisa|18000.0| 女|
+-------+-------+---+
等价写法
"select name,salary, case when sex=true then '男' when sex=false then '女' else '未知' end as sex from t_user"
别名不能是中文
行专列
case-when
val coursedf = spark.sparkContext.parallelize(List(
StudentCourse(1, "语文", 100),
StudentCourse(1, "数学", 100),
StudentCourse(1, "英语", 100),
StudentCourse(2, "数学", 79),
StudentCourse(2, "语文", 80),
StudentCourse(2, "英语", 100)
)).toDF()
coursedf.createOrReplaceTempView("t_course")
spark.sql("select id,max(case course when '语文' then score else 0 end) as chinese,max(case course when '数学' then score else 0 end ) as math,max(case course when '英语' then score else 0 end ) as english from t_course group by id ").show()
+---+-------+-----+-------+
| id|chinese| math|english|
+---+-------+-----+-------+
| 1| 100.0|100.0| 100.0|
| 2| 80.0| 79.0| 100.0|
+---+-------+-----+-------+
pivot
val userCostDF = spark.sparkContext.parallelize(List(
StudentCourse(1, "语文", 100),
StudentCourse(1, "数学", 100),
StudentCourse(1, "英语", 100),
StudentCourse(2, "数学", 79),
StudentCourse(2, "语文", 80),
StudentCourse(2, "英语", 100)
)).toDF()
userCostDF.createOrReplaceTempView("t_course")
spark.sql("select * from t_course pivot(max(score) for course in ('数学','语文','英语')) ")
.na.fill(0).show()
+---+-----+-----+-----+
| id| 数学| 语文| 英语|
+---+-----+-----+-----+
| 1|100.0|100.0|100.0|
| 2| 79.0| 80.0|100.0|
+---+-----+-----+-----+
这里需要注意,在书写SQL的时候除去聚合字段和输出列明字段,其他字段作为groupby后的隐藏字段。
表连接
spark.sql("select * from t_user left join t_dept on deptNo=id").show()
+-------+---+-------+-----+--------+------+---+----+
| name|age| salary| sex| job|deptNo| id|name|
+-------+---+-------+-----+--------+------+---+----+
|Michael| 29|20000.0| true| MANAGER| 1| 1|研发|
| Andy| 30|15000.0| true|SALESMAN| 1| 1|研发|
| Justin| 19| 8000.0| true| CLERK| 1| 1|研发|
| Kaine| 20|20000.0| true| MANAGER| 2| 2|设计|
| Lisa| 19|18000.0|false|SALESMAN| 2| 2|设计|
+-------+---+-------+-----+--------+------+---+----+
spark支持inner join、left outer、right outer、full outer join连接
子查询
spark.sql("select * ,(select count(t1.salary) from t_user t1 where (t1.deptNo = t2.deptNo) group by t1.deptNo) as total from t_user t2 left join t_dept on t2.deptNo=id order by t2.deptNo asc,t2.salary desc").show()
+-------+---+-------+-----+--------+------+---+----+-----+
| name|age| salary| sex| job|deptNo| id|name|total|
+-------+---+-------+-----+--------+------+---+----+-----+
|Michael| 29|20000.0| true| MANAGER| 1| 1|研发| 3|
| Andy| 30|15000.0| true|SALESMAN| 1| 1|研发| 3|
| Justin| 19| 8000.0| true| CLERK| 1| 1|研发| 3|
| Kaine| 20|20000.0| true| MANAGER| 2| 2|设计| 2|
| Lisa| 19|18000.0|false|SALESMAN| 2| 2|设计| 2|
+-------+---+-------+-----+--------+------+---+----+-----+
注意在spark中仅仅支持内嵌子查询
=
查询
开窗函数
在正常的统计分析中 ,通常使用聚合函数作为分析,聚合分析函数的特点是将n行记录合并成一行,在数据库的统计当中 还有一种统计称为开窗统计,开窗函数可以实现将一行变成多行。可以将数据库查询的每一条记录比作是一幢高楼的一 层, 开窗函数就是在每一层开一扇窗, 让每一层能看到整装楼的全貌或一部分。
- 输出员工信息以及所在部门的平均薪资
spark.sql("select u.name,u.deptNo,u.salary,d.name,avg(salary) over(partition by deptNo ) avg from t_user u left join t_dept d on deptNo=id").show()
+-------+------+-------+----+------------------+
| name|deptNo| salary|name| avg|
+-------+------+-------+----+------------------+
|Michael| 1|20000.0|研发|14333.333333333334|
| Andy| 1|15000.0|研发|14333.333333333334|
| Justin| 1| 8000.0|研发|14333.333333333334|
| Kaine| 2|20000.0|设计| 19000.0|
| Lisa| 2|18000.0|设计| 19000.0|
+-------+------+-------+----+------------------+
等价写法
select u2.name,u2.deptNo,u2.salary,d.name,(select avg(salary) from t_user u1 where u1.deptNo=u2.deptNo group by u1.deptNo) avgSalary from t_user u2 left join t_dept d on deptNo=id
- 计算员工在自己部门的薪资排序
spark.sql("select u.name,u.deptNo,u.salary,d.name,sum(1) over(partition by deptNo order by salary desc) rank from t_user u left join t_dept d on deptNo=id").show()
+-------+------+-------+----+----+
| name|deptNo| salary|name|rank|
+-------+------+-------+----+----+
|Michael| 1|20000.0|研发| 1|
| Andy| 1|15000.0|研发| 2|
| Justin| 1| 8000.0|研发| 3|
| Kaine| 2|20000.0|设计| 1|
| Lisa| 2|18000.0|设计| 2|
+-------+------+-------+----+----+
其中sum(1)可以替换成ROW_NUM()
- 计算员工在公司的薪资排名
spark.sql("select t_user.name,deptNo,salary,t_dept.name,ROW_NUMBER() over(order by salary desc) rank from t_user left join t_dept on deptNo=id order by deptNo").show()
+-------+------+-------+----+----+
| name|deptNo| salary|name|rank|
+-------+------+-------+----+----+
| Justin| 1| 8000.0|研发| 5|
| Andy| 1|15000.0|研发| 4|
|Michael| 1|20000.0|研发| 1|
| Kaine| 2|20000.0|设计| 2|
| Lisa| 2|18000.0|设计| 3|
+-------+------+-------+----+----+
可以看出,ROW_NUMBER()只表示数据出库的顺序,无法比较真正的顺序。因此一般在做排名的时候一般使用RANK()或者DENSE_RANK()函数。
- 使用RANK函数实现薪资排名(序号不连续)
spark.sql("select t_user.name,deptNo,salary,t_dept.name,RANK() over(order by salary desc) rank from t_user left join t_dept on deptNo=id order by deptNo").show()
+-------+------+-------+----+----+
| name|deptNo| salary|name|rank|
+-------+------+-------+----+----+
| Justin| 1| 8000.0|研发| 5|
| Andy| 1|15000.0|研发| 4|
|Michael| 1|20000.0|研发| 1|
| Kaine| 2|20000.0|设计| 1|
| Lisa| 2|18000.0|设计| 3|
+-------+------+-------+----+----+
- 使用DENSE_RANK函数实现薪资排名(序号连续)
spark.sql("select t_user.name,deptNo,salary,t_dept.name,DENSE_RANK() over(order by salary desc) rank from t_user left join t_dept on deptNo=id order by deptNo").show()
+-------+------+-------+----+----+
| name|deptNo| salary|name|rank|
+-------+------+-------+----+----+
| Justin| 1| 8000.0|研发| 4|
| Andy| 1|15000.0|研发| 3|
|Michael| 1|20000.0|研发| 1|
| Kaine| 2|20000.0|设计| 1|
| Lisa| 2|18000.0|设计| 2|
+-------+------+-------+----+----+
- 计算每个部门和本部门最高薪资的差值
spark.sql("select t_user.name,deptNo,salary,t_dept.name,(salary- max(salary) over(partition by deptNo order by salary desc rows between unbounded preceding and current row)) diff from t_user left join t_dept on deptNo=id order by deptNo").show()
+-------+------+-------+----+--------+
| name|deptNo| salary|name| diff|
+-------+------+-------+----+--------+
|Michael| 1|20000.0|研发| 0.0|
| Andy| 1|15000.0|研发| -5000.0|
| Justin| 1| 8000.0|研发|-12000.0|
| Kaine| 2|20000.0|设计| 0.0|
| Lisa| 2|18000.0|设计| -2000.0|
+-------+------+-------+----+--------+
- 计算本部门的员工工资和最小工资的差值
spark.sql("select t_user.name,deptNo,salary,t_dept.name,(salary- min(salary) over(partition by deptNo order by salary desc rows between current row and unbounded following)) diff from t_user left join t_dept on deptNo=id order by deptNo").show()
+-------+------+-------+----+-------+
| name|deptNo| salary|name| diff|
+-------+------+-------+----+-------+
| Justin| 1| 8000.0|研发| 0.0|
| Andy| 1|15000.0|研发| 7000.0|
|Michael| 1|20000.0|研发|12000.0|
| Kaine| 2|20000.0|设计| 2000.0|
| Lisa| 2|18000.0|设计| 0.0|
+-------+------+-------+----+-------+
- 计算本部门员工和公司平均薪资的差值
spark.sql("select t_user.name,deptNo,salary,t_dept.name,(salary- avg(salary) over(order by salary rows between unbounded preceding and unbounded following)) diff from t_user left join t_dept on deptNo=id order by deptNo").show()
+-------+------+-------+----+-------+
| name|deptNo| salary|name| diff|
+-------+------+-------+----+-------+
|Michael| 1|20000.0|研发| 3800.0|
| Andy| 1|15000.0|研发|-1200.0|
| Justin| 1| 8000.0|研发|-8200.0|
| Lisa| 2|18000.0|设计| 1800.0|
| Kaine| 2|20000.0|设计| 3800.0|
+-------+------+-------+----+-------+
总结
聚合函数(字段) over ([partition by 字段] order by 字段 asc [rows between 起始行偏移量 and 终止偏移量] )
其中:偏移量的取值
preceding:用于累加前N行(分区之内)。若是从分区第一行头开始,则为 unbounded。 N为:相对当前行向前的偏移量。
following:与preceding相反,累加后N行(分区之内)。若是累加到该分区结束则为unbounded。N为:相对当前行向后的偏移量
current row:顾名思义,当前行,偏移量为0
说明:上边的前N,后M,以及current row均会累加该偏移量所在行
自定义函数
单行函数
更具员工的职位,计算年薪
spark.udf.register("annual_salary",(job:String,salary:Double)=>{
job match {
case "MANAGER" => salary*12 + 5000000
case "SALESMAN" => salary*12 + 100000
case "CLERK" => salary*12 + 20000
case _ => salary*12
}
})
spark.sql("select * ,annual_salary(job,salary) annual_salary from t_user").show()
+-------+---+-------+-----+--------+------+-------------+
| name|age| salary| sex| job|deptNo|annual_salary|
+-------+---+-------+-----+--------+------+-------------+
|Michael| 29|20000.0| true| MANAGER| 1| 5240000.0|
| Andy| 30|15000.0| true|SALESMAN| 1| 280000.0|
| Justin| 19| 8000.0| true| CLERK| 1| 116000.0|
| Kaine| 20|20000.0| true| MANAGER| 2| 5240000.0|
| Lisa| 19|18000.0|false|SALESMAN| 2| 316000.0|
+-------+---+-------+-----+--------+------+-------------+
如果使用API形式调用可以使用:
userDS.selectExpr("name","age","job","salary","annual_salary(job,salary) as annual_salary").show()
聚合函数(了解)
有类型聚合(SQL)
自定义UserDefinedAggregateFunction实现类
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types.{DataType, DoubleType, StructType}
class DeptSalarySum extends UserDefinedAggregateFunction{
//说明输入的Schema
override def inputSchema: StructType = {
new StructType().add("salary","double")
}
//缓冲临时变量
override def bufferSchema: StructType = {
new StructType().add("total","double")
}
//返回值类型
override def dataType: DataType = {
DoubleType
}
//表示系统给定的一组输入,总有固定的输出类型
override def deterministic: Boolean = true
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer.update(0,0.0D)
}
//局部计算
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val history = buffer.getDouble(0)
buffer.update(0,history + input.getAs[Double](0))
}
//中间结果聚合
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1.update(0,buffer1.getDouble(0)+buffer2.getDouble(0))
}
//返回最终结果
override def evaluate(buffer: Row): Any = {
buffer.getDouble(0)
}
}
注册聚合类
spark.udf.register("deptSalarySum",new DeptSalarySum)
调用聚合函数
spark.sql("select deptNo,deptSalarySum(salary) as salary from t_user group by deptNo").show()
+------+---------------------+
|deptNo|deptsalarysum(salary)|
+------+---------------------+
| 1| 43000.0|
| 2| 38000.0|
+------+---------------------+
强类型聚合(API)
强类型聚合用户需要实现Aggregate接口同时需要定制State变量用于存储聚合过程中的中间变量。用于实现针对Dataset数据的集合的聚合。
DeptSalarySate
class DeptSalarySate(var initValue:Double=0 ) {
def add(v:Double): Unit ={
initValue += v
}
def get():Double={
initValue
}
}
DeptSalaryAggregator
import org.apache.spark.sql.{Encoder, Encoders}
import org.apache.spark.sql.expressions.Aggregator
class DeptSalaryAggregator extends Aggregator[Employee,DeptSalarySate,Double]{
override def zero: DeptSalarySate = new DeptSalarySate()
override def reduce(b: DeptSalarySate, a: Employee): DeptSalarySate ={
b.add(a.salary)
b
}
override def merge(b1: DeptSalarySate, b2: DeptSalarySate): DeptSalarySate = {
b1.add(b2.get())
b1
}
override def finish(reduction: DeptSalarySate): Double = {
reduction.get()
}
override def bufferEncoder: Encoder[DeptSalarySate] = {
//Encoders.product[DeptSalarySate]
Encoders.kryo(classOf[DeptSalarySate])
}
override def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}
测试程序
var myagg=new DeptSalaryAggregator().toColumn.name("avg")
userDS.select(myagg).show()
+-------+
| avg|
+-------+
|81000.0|
+-------+
Load/Save
paquet文件
生成parquet文件
Parquet仅仅是一种存储格式,它是语言、平台无关的,并且不需要和任何一种数据处理框架绑定.
val spark = SparkSession.builder()
.master("local[6]")
.appName("sql")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL")
val userDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_employee.txt")
.map(line => line.split(","))
.map(t => Employee(t(0), t(1).toInt, t(2).toDouble, t(3).toBoolean, t(4), t(5).toInt))
.toDS().as("u")
val deptDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_dept.txt")
.map(line => line.split(","))
.map(t => Dept(t(0).toInt, t(1)))
.toDS().as("d")
userDS.select("name","sex","salary","deptNo")
.withColumn("annual_salary",$"salary"*12)
.join(deptDS,$"u.deptNo" ===$"d.id" ,"left_outer")
.drop("id")
.toDF("name","sex","salary","dept_no","annual_salary","dept_name")
.write
.save("file:///Users/jiangzz/Desktop/results/parquet/results.parquet")
spark.close()
读取Parquet文件
val spark = SparkSession.builder()
.master("local[6]")
.appName("sql")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL")
//spark.read.parquet 等价
spark.read.load("file:///Users/jiangzz/Desktop/results/parquet/results.parquet")
.show()
spark.close()
+-------+-----+-------+-------+-------------+---------+
| name| sex| salary|dept_no|annual_salary|dept_name|
+-------+-----+-------+-------+-------------+---------+
|Michael| true|20000.0| 1| 240000.0| 研发|
| Andy| true|15000.0| 1| 180000.0| 研发|
| Justin| true| 8000.0| 1| 96000.0| 研发|
| Kaine| true|20000.0| 2| 240000.0| 设计|
| Lisa|false|18000.0| 2| 216000.0| 设计|
+-------+-----+-------+-------+-------------+---------+
Json格式
产生json格式
val spark = SparkSession.builder()
.master("local[6]")
.appName("sql")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL")
val userDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_employee.txt")
.map(line => line.split(","))
.map(t => Employee(t(0), t(1).toInt, t(2).toDouble, t(3).toBoolean, t(4), t(5).toInt))
.toDS().as("u")
val deptDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_dept.txt")
.map(line => line.split(","))
.map(t => Dept(t(0).toInt, t(1)))
.toDS().as("d")
userDS.select("name","sex","salary","deptNo")
.withColumn("annual_salary",$"salary"*12)
.join(deptDS,$"u.deptNo" ===$"d.id" ,"left_outer")
.drop("id")
.toDF("name","sex","salary","dept_no","annual_salary","dept_name")
.write
.mode(SaveMode.Ignore)
.format("json")
.save("file:///Users/jiangzz/Desktop/results/json/")
spark.close()
读取Json格式
val spark = SparkSession.builder()
.master("local[6]")
.appName("sql")
.getOrCreate()
spark.sparkContext.setLogLevel("FATAL")
import spark.implicits._
spark.read.json("file:///Users/jiangzz/Desktop/results/json/")
.show()
spark.close()
csv格式
生成
val spark = SparkSession.builder()
.master("local[6]")
.appName("sql")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL")
val userDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_employee.txt")
.map(line => line.split(","))
.map(t => Employee(t(0), t(1).toInt, t(2).toDouble, t(3).toBoolean, t(4), t(5).toInt))
.toDS().as("u")
val deptDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_dept.txt")
.map(line => line.split(","))
.map(t => Dept(t(0).toInt, t(1)))
.toDS().as("d")
userDS.select("name","sex","salary","deptNo")
.withColumn("annual_salary",$"salary"*12)
.join(deptDS,$"u.deptNo" ===$"d.id" ,"left_outer")
.drop("id")
.toDF("name","sex","salary","dept_no","annual_salary","dept_name")
.write
.mode(SaveMode.Ignore)
.format("csv")
.option("sep", ";")
.option("inferSchema", "true")
.option("header", "true")
.save("file:///Users/jiangzz/Desktop/results/csv/")
spark.close()
读取
val spark = SparkSession.builder()
.master("local[6]")
.appName("sql")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL")
spark.read
.option("sep", ";")
.option("inferSchema", "true")
.option("header", "true")
.csv("file:///Users/jiangzz/Desktop/results/csv/")
.show()
spark.close()
ORC格式
ORC的全称是(Optimized Row Columnar),ORC文件格式是一种Hadoop生态圈中的列式存储格式,它的产生早在2013年初,最初产生自Apache Hive,用于降低Hadoop数据存储空间和加速Hive查询速度。
生成
val spark = SparkSession.builder()
.master("local[6]")
.appName("sql")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL")
val userDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_employee.txt")
.map(line => line.split(","))
.map(t => Employee(t(0), t(1).toInt, t(2).toDouble, t(3).toBoolean, t(4), t(5).toInt))
.toDS().as("u")
val deptDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_dept.txt")
.map(line => line.split(","))
.map(t => Dept(t(0).toInt, t(1)))
.toDS().as("d")
userDS.select("name","sex","salary","deptNo")
.withColumn("annual_salary",$"salary"*12)
.join(deptDS,$"u.deptNo" ===$"d.id" ,"left_outer")
.drop("id")
.toDF("name","sex","salary","dept_no","annual_salary","dept_name")
.write
.mode(SaveMode.Ignore)
.format("orc")
.option("orc.bloom.filter.columns", "favorite_color")
.option("orc.dictionary.key.threshold", "1.0")
.save("file:///Users/jiangzz/Desktop/results/orc/")
spark.close()
读取
val spark = SparkSession.builder()
.master("local[6]")
.appName("sql")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL")
val userDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_employee.txt")
.map(line => line.split(","))
.map(t => Employee(t(0), t(1).toInt, t(2).toDouble, t(3).toBoolean, t(4), t(5).toInt))
.toDS().as("u")
val deptDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_dept.txt")
.map(line => line.split(","))
.map(t => Dept(t(0).toInt, t(1)))
.toDS().as("d")
spark.read
.option("orc.bloom.filter.columns", "favorite_color")
.option("orc.dictionary.key.threshold", "1.0")
.orc("file:///Users/jiangzz/Desktop/results/orc/")
.show()
spark.close()
SQL读取文件
val spark = SparkSession.builder()
.master("local[6]")
.appName("sql")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL")
val parqeutDF = spark.sql("SELECT * FROM parquet.`file:///Users/jiangzz/Desktop/results/parquet/results.parquet`")
val jsonDF = spark.sql("SELECT * FROM json.`file:///Users/jiangzz/Desktop/results/json/`")
val orcDF = spark.sql("SELECT * FROM orc.`file:///Users/jiangzz/Desktop/results/orc/`")
//parqeutDF.show()
//jsonDF.show()
//csvDF.show()
orcDF.show()
spark.close()
JDBC数据库读取
读取MysQL
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.47</version>
</dependency>
val spark = SparkSession.builder()
.master("local[6]")
.appName("sql")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL")
spark.read
.format("jdbc")
.option("url", "jdbc:mysql://CentOS:3306/test")
.option("dbtable", "t_user")
.option("user", "root")
.option("password", "root")
.load().createTempView("t_user")
spark.sql("select * from t_user").show()
spark.close()
写入MySQL
val spark = SparkSession.builder()
.master("local[6]")
.appName("sql")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL")
val userDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_employee.txt")
.map(line => line.split(","))
.map(t => Employee(t(0), t(1).toInt, t(2).toDouble, t(3).toBoolean, t(4), t(5).toInt))
.toDS().as("u")
val deptDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_dept.txt")
.map(line => line.split(","))
.map(t => Dept(t(0).toInt, t(1)))
.toDS().as("d")
val props = new Properties()
props.put("user", "root")
props.put("password", "root")
userDS.select("name","sex","salary","deptNo")
.withColumn("annual_salary",$"salary"*12)
.join(deptDS,$"u.deptNo" ===$"d.id" ,"left_outer")
.drop("id")
.toDF("name","sex","salary","dept_no","annual_salary","dept_name")
.write
.mode("append")
.jdbc("jdbc:mysql://CentOS:3306/test","t_user",props)
spark.close()