列的选择select

来个例子边看边说:

1. scala> val df = spark.createDataset(Seq(
2. ("aaa", 1, 2), ("bbb", 3, 4), ("ccc", 3, 5), ("bbb", 4, 6))
3. ).toDF("key1","key2","key3")
4. df: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 1 more field]

 

1. scala> df.printSchema
2. root
3. |-- key1: string (nullable = true)
4. |-- key2: integer (nullable = false)
5. |-- key3: integer (nullable = false)
6. scala> df.collect
7. res34: Array[org.apache.spark.sql.Row] = Array([aaa,1,2], [bbb,3,4], [ccc,3,5], [bbb,4,6])

上面的代码创建了一个DataFrame,有三列,列名分别是key1, key2, key3, 类型对应string, integer, integer。 当前造了4条记录,如上所示。

 

 

接下来看看选择列的几种调用方式:

 

1. scala> df.select("key1").collect
2. res49: Array[org.apache.spark.sql.Row] = Array([aaa], [bbb], [ccc], [bbb])
3.  
4. scala> df.select($"key1").collect
5. res50: Array[org.apache.spark.sql.Row] = Array([aaa], [bbb], [ccc], [bbb])
6.  
7. scala> df.select(df.col("key1")).collect
8. res51: Array[org.apache.spark.sql.Row] = Array([aaa], [bbb], [ccc], [bbb])

select方法中参数直接用字符串只能用DataFrame中的命名字段名。不能对字段名再使用像SQL的select语法的表达式。但是$"key1"这种写法对应的是select方法的Column参数类型重载,可以支持sql的select语法了:

 

 

1. scala> df.select(upper($"key1")).collect
2. res58: Array[org.apache.spark.sql.Row] = Array([AAA], [BBB], [CCC], [BBB])
3.  
4. scala> df.select(upper("key1")).collect
5. <console>:27: error: type mismatch;
6. found : String("key1")
7. required: org.apache.spark.sql.Column
8. df.select(upper("key1")).collect

上面在select中对字段key1调用了upper函数转换大小写,注意"key1"前面有个$符号,这个是scala最喜欢搞的语法糖,了解下写代码会很方便。而下面没有加$符号在key1前面时就报错了,提示需要的是Column,而当前给的则是个String类型。

 

 

这时候的select也可以用selectExtr方法替换。比如下面的调用:

 

1. scala> df.selectExpr("upper(key1)", "key2 as haha2").show
2. +-----------+-----+
3. |upper(key1)|haha2|
4. +-----------+-----+
5. | AAA| 1|
6. | BBB| 3|
7. | CCC| 3|
8. | BBB| 4|
9. +-----------+-----+

key1字段调用了变大写的函数,而key2字段改了别名haha2, ok, 一切顺利!

 

Where部分可以用filter函数和where函数。这俩函数的用法是一样的,官网文档里都说where是filter的别名。

数据还是用上一篇里造的那个dataset:

scala> val df = spark.createDataset(Seq(
  ("aaa",1,2),("bbb",3,4),("ccc",3,5),("bbb",4, 6))   ).toDF("key1","key2","key3")
df: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 1 more field]

scala> df.show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa|   1|   2|
| bbb|   3|   4|
| ccc|   3|   5|
| bbb|   4|   6|
+----+----+----+

filter函数

从Spark官网的文档中看到,filter函数有下面几种形式:

def filter(func: (T) ⇒ Boolean): Dataset[T]
def filter(conditionExpr: String): Dataset[T]
def filter(condition: Column): Dataset[T]

所以,以下几种写法都是可以的:

scala> df.filter($"key1">"aaa").show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| bbb|   3|   4|
| ccc|   3|   5|
| bbb|   4|   6|
+----+----+----+


scala> df.filter($"key1"==="aaa").show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa|   1|   2|
+----+----+----+

scala> df.filter("key1='aaa'").show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa|   1|   2|
+----+----+----+

scala> df.filter("key2=1").show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa|   1|   2|
+----+----+----+

scala> df.filter($"key2"===3).show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| bbb|   3|   4|
| ccc|   3|   5|
+----+----+----+

scala> df.filter($"key2"===$"key3"-1).show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa|   1|   2|
| bbb|   3|   4|
+----+----+----+

其中, ===是在Column类中定义的函数,对应的不等于是=!=。 $”列名”这个是语法糖,返回Column对象

where函数

scala> df.where("key1 = 'bbb'").show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| bbb|   3|   4|
| bbb|   4|   6|
+----+----+----+


scala> df.where($"key2"=!= 3).show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa|   1|   2|
| bbb|   4|   6|
+----+----+----+


scala> df.where($"key3">col("key2")).show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa|   1|   2|
| bbb|   3|   4|
| ccc|   3|   5|
| bbb|   4|   6|
+----+----+----+


scala> df.where($"key3">col("key2")+1).show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| ccc|   3|   5|
| bbb|   4|   6|
+----+----+----+

分组,聚合,排序

scala> val df = spark.createDataset(Seq(
  ("aaa",1,2),("bbb",3,4),("ccc",3,5),("bbb",4, 6))   ).toDF("key1","key2","key3")
df: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 1 more field]

scala> df.printSchema
root
 |-- key1: string (nullable = true)
 |-- key2: integer (nullable = false)
 |-- key3: integer (nullable = false)

scala> df.show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa|   1|   2|
| bbb|   3|   4|
| ccc|   3|   5|
| bbb|   4|   6|
+----+----+----+

先来个最简单的分组计数:

/*
等价SQL: select key1, count(*) from table
*/
scala> df.groupBy("key1").count.show
+----+-----+
|key1|count|
+----+-----+
| ccc|    1|
| aaa|    1|
| bbb|    2|
+----+-----+

注意,上面代码中的count不是记录数,而是对groupBy的聚合结果的计数。如果是要看分组后有多少条记录,代码如下。可以看到在这个示例数据集中结果应该是3条:

/*
等价SQL: select distinct key1 from table
*/
scala> df.select("key1").distinct.show
+----+
|key1|
+----+
| ccc|
| aaa|
| bbb|
+----+

/*
等价SQL: 
select count(distinct key1) from table
*/
scala> df.select("key1").distinct.count
res3: Long = 3

上面的结果中,如果你跟我一样有强迫症的话,显然应该注意到了key1的显示没有排序,不能忍。修改如下:

/*
等价sql: 
select key1 , count(*) from table 
group by key1 
order by key1
*/

scala> df.groupBy("key1").count.sort("key1").show
+----+-----+
|key1|count|
+----+-----+
| aaa|    1|
| bbb|    2|
| ccc|    1|
+----+-----+

/*
等价sql: 
select key1 , count(*) from table 
group by key1 
order by key1 desc
*/

scala> df.groupBy("key1").count.sort($"key1".desc).show
+----+-----+
|key1|count|
+----+-----+
| ccc|    1|
| bbb|    2|
| aaa|    1|
+----+-----+

注意,上面一个是升序,一个是降序。和”select key1 , count(*) from table group by key1 order by key1 desc”降序的时候指定desc的时候,前面的key1跟了一个前缀,上一篇说过了,这个是col(column−name)的语法糖。以后的前缀,上一篇说过了,这个是col(column−name)的语法糖。以后的$前缀就不再解释了。

继续完善下,之前默认是按照分组计数的大小的升序排列的。如果要按分组计数的大小的逆序排序要怎么做呢?看之前的show结果,计数列显示的列名就是count。所以,自然就能想到下面的写法:

/*
等价sql: 
select key1 , count(*) from table  
group by key1 
order by count(*) desc
*/
scala> df.groupBy("key1").count.sort($"count".desc).show
+----+-----+
|key1|count|
+----+-----+
| bbb|    2|
| ccc|    1|
| aaa|    1|
+----+-----+

或者是用withColumnRenamed函数给列重命名:

/*
等价sql: 
select key1 , count(*) as cnt from table  
group by key1 
order by cnt desc
*/
scala> df.groupBy("key1").count.withColumnRenamed("count", "cnt").sort($"cnt".desc).show
+----+---+
|key1|cnt|
+----+---+
| bbb|  2|
| aaa|  1|
| ccc|  1|
+----+---+

更常用的方法是,直接给count(*)来个别名。如下:

/*
等价sql: 
select key1 , count(*) as cnt from table  
group by key1 
order by cnt desc
*/
scala> df.groupBy("key1").agg(count("key1").as("cnt")).show
+----+---+
|key1|cnt|
+----+---+
| ccc|  1|
| aaa|  1|
| bbb|  2|
+----+---+

嗯,我们看到这里引入了聚合函数agg。这函数通常是配合groupBy的,用法灵活。下面用几个示例代码直接上,注意区别Column类型参数和String类型参数:

def agg(expr: Column, exprs: Column*): DataFrame
def agg(exprs: Map[String, String]): DataFrame
def agg(aggExpr: (String, String), 
        aggExprs: (String, String)*): DataFrame

/*
等价sql:
select key1, count(key1), max(key2), avg(key3)
from table
group by key1
*/
scala> df.groupBy("key1").agg(count("key1"), max("key2"), avg("key3")).show
+----+-----------+---------+---------+
|key1|count(key1)|max(key2)|avg(key3)|
+----+-----------+---------+---------+
| ccc|          1|        3|      5.0|
| aaa|          1|        1|      2.0|
| bbb|          2|        4|      5.0|
+----+-----------+---------+---------+

scala> df.groupBy("key1")
        .agg("key1"->"count", "key2"->"max", "key3"->"avg").show
+----+-----------+---------+---------+
|key1|count(key1)|max(key2)|avg(key3)|
+----+-----------+---------+---------+
| ccc|          1|        3|      5.0|
| aaa|          1|        1|      2.0|
| bbb|          2|        4|      5.0|
+----+-----------+---------+---------+

scala> df.groupBy("key1").agg(Map(("key1","count"), ("key2","max"), ("key3","avg"))).show
+----+-----------+---------+---------+
|key1|count(key1)|max(key2)|avg(key3)|
+----+-----------+---------+---------+
| ccc|          1|        3|      5.0|
| aaa|          1|        1|      2.0|
| bbb|          2|        4|      5.0|
+----+-----------+---------+---------+

scala> df.groupBy("key1")
        .agg(("key1","count"), ("key2","max"), ("key3","avg")).show
+----+-----------+---------+---------+
|key1|count(key1)|max(key2)|avg(key3)|
+----+-----------+---------+---------+
| ccc|          1|        3|      5.0|
| aaa|          1|        1|      2.0|
| bbb|          2|        4|      5.0|
+----+-----------+---------+---------+

/*
等价sql:
select key1, count(key1) cnt, 
       max(key2) max_key2, avg(key3) avg_key3 
from table 
group by key1 
order by key1, max_key2 desc 
*/

scala> df.groupBy("key1").agg(count("key1").as("cnt"), max("key2").as("max_key2"), avg("key3").as("avg_key3")).sort($"cnt",$"max_key2".desc).show
+----+---+--------+--------+
|key1|cnt|max_key2|avg_key3|
+----+---+--------+--------+
| ccc|  1|       3|     5.0|
| aaa|  1|       1|     2.0|
| bbb|  2|       4|     5.0|
+----+---+--------+--------+

其他单表操作

还有些杂七杂八的小用法没有提到,比如添加列,删除列,NA值处理之类的,就在这里大概列一下吧。

数据集还是之前的那个吧:

scala> val df = spark.createDataset(Seq(
  ("aaa",1,2),("bbb",3,4),("ccc",3,5),("bbb",4, 6))   ).toDF("key1","key2","key3")
df: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 1 more field]

scala> df.printSchema
root
 |-- key1: string (nullable = true)
 |-- key2: integer (nullable = false)
 |-- key3: integer (nullable = false)

scala> df.show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa|   1|   2|
| bbb|   3|   4|
| ccc|   3|   5|
| bbb|   4|   6|
+----+----+----+

下面来添加一列,可以是字符串类型,整型;可以是常量或者是对当前已有的某列的变换,都行:

/* 
新增字符串类型的列key_4,都初始化为new_str_col,注意这里的lit()函数 
还有人发消息说这个lit(),补一下说明吧。这里的lit()是spark自带的函数,需要import org.apache.spark.sql.functions
def lit(literal: Any): Column Creates a Column of literal value. The passed in object is returned
directly if it is already a Column. If the object is a Scala Symbol, it is converted into a Column
also. Otherwise, a new Column is created to represent the literal value. 
Since 1.3.0
*/
scala> val df_1 = df.withColumn("key4", lit("new_str_col"))
df_1: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 2 more fields]

scala> df_1.printSchema
root
 |-- key1: string (nullable = true)
 |-- key2: integer (nullable = false)
 |-- key3: integer (nullable = false)
 |-- key4: string (nullable = false)

scala> df_1.show
+----+----+----+-----------+
|key1|key2|key3|       key4|
+----+----+----+-----------+
| aaa|   1|   2|new_str_col|
| bbb|   3|   4|new_str_col|
| ccc|   3|   5|new_str_col|
| bbb|   4|   6|new_str_col|
+----+----+----+-----------+

/* 
同样的,新增Int类型的列key5,都初始化为1024 
*/
scala> val df_2 = df_1.withColumn("key5", lit(1024))
df_2: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 3 more fields]

scala> df_2.printSchema
root
 |-- key1: string (nullable = true)
 |-- key2: integer (nullable = false)
 |-- key3: integer (nullable = false)
 |-- key4: string (nullable = false)
 |-- key5: integer (nullable = false)

scala> df_2.show
+----+----+----+-----------+-----+
|key1|key2|key3|       key4|key5|
+----+----+----+-----------+-----+
| aaa|   1|   2|new_str_col| 1024|
| bbb|   3|   4|new_str_col| 1024|
| ccc|   3|   5|new_str_col| 1024|
| bbb|   4|   6|new_str_col| 1024|
+----+----+----+-----------+-----+

/*
再来个不是常量的新增列key6 = key5 * 2
*/
scala> val df_3 = df_2.withColumn("key6", $"key5"*2)
df_3: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 4 more fields]

scala> df_3.show
+----+----+----+-----------+----+----+
|key1|key2|key3|       key4|key5|key6|
+----+----+----+-----------+----+----+
| aaa|   1|   2|new_str_col|1024|2048|
| bbb|   3|   4|new_str_col|1024|2048|
| ccc|   3|   5|new_str_col|1024|2048|
| bbb|   4|   6|new_str_col|1024|2048|
+----+----+----+-----------+----+----+

/*
这次是用的expr()函数
*/
scala> val df_4 = df_2.withColumn("key6", expr("key5 * 4"))
df_4: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 4 more fields]

scala> df_4.show
+----+----+----+-----------+----+----+
|key1|key2|key3|       key4|key5|key6|
+----+----+----+-----------+----+----+
| aaa|   1|   2|new_str_col|1024|4096|
| bbb|   3|   4|new_str_col|1024|4096|
| ccc|   3|   5|new_str_col|1024|4096|
| bbb|   4|   6|new_str_col|1024|4096|
+----+----+----+-----------+----+----+

删除列就比较简单了,指定列名就好了

/*
删除列key5
*/
scala> val df_5 = df_4.drop("key5")
df_5: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 3 more fields]

scala> df_4.printSchema
root
 |-- key1: string (nullable = true)
 |-- key2: integer (nullable = false)
 |-- key3: integer (nullable = false)
 |-- key4: string (nullable = false)
 |-- key5: integer (nullable = false)
 |-- key6: integer (nullable = false)

scala> df_5.printSchema
root
 |-- key1: string (nullable = true)
 |-- key2: integer (nullable = false)
 |-- key3: integer (nullable = false)
 |-- key4: string (nullable = false)
 |-- key6: integer (nullable = false)

scala> df_5.show
+----+----+----+-----------+----+
|key1|key2|key3|       key4|key6|
+----+----+----+-----------+----+
| aaa|   1|   2|new_str_col|4096|
| bbb|   3|   4|new_str_col|4096|
| ccc|   3|   5|new_str_col|4096|
| bbb|   4|   6|new_str_col|4096|
+----+----+----+-----------+----+

/*
可以一次删除多列key4和key6
*/
scala> val df_6 = df_5.drop("key4", "key6")
df_6: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 1 more field]

/* 这里的columns函数以数组形式返回所有列名 */
scala> df_6.columns
res23: Array[String] = Array(key1, key2, key3)

scala> df_6.show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa|   1|   2|
| bbb|   3|   4|
| ccc|   3|   5|
| bbb|   4|   6|
+----+----+----+

再写几个null值等无效数据的一些处理吧 这次得换个数据集,null值的表用个csv文件导入,代码如下:

/*
csv文件内容如下:
key1,key2,key3,key4,key5
aaa,1,2,t1,4
bbb,5,3,t2,8
ccc,2,2,,7
,7,3,t1,
bbb,1,5,t3,0
,4,,t1,8 
*/
scala> val df = spark.read.option("header","true").csv("natest.csv")
df: org.apache.spark.sql.DataFrame = [key1: string, key2: string ... 3 more fields]

scala> df.show
+----+----+----+----+----+
|key1|key2|key3|key4|key5|
+----+----+----+----+----+
| aaa|   1|   2|  t1|   4|
| bbb|   5|   3|  t2|   8|
| ccc|   2|   2|null|   7|
|null|   7|   3|  t1|null|
| bbb|   1|   5|  t3|   0|
| null|   4|null|  t1|   8|
+----+----+----+----+----+

/*
把key1列中所有的null值替换成'xxx' 
*/
scala> val df_2 = df.na.fill("xxx",Seq("key1"))
df_2: org.apache.spark.sql.DataFrame = [key1: string, key2: string ... 3 more fields]

scala> df_2.show
+----+----+----+----+----+
|key1|key2|key3|key4|key5|
+----+----+----+----+----+
| aaa|   1|   2|  t1|   4|
| bbb|   5|   3|  t2|   8|
| ccc|   2|   2|null|   7|
| xxx|   7|   3|  t1|null|
| bbb|   1|   5|  t3|   0|
| xxx|   4|null|  t1|   8|
+----+----+----+----+----+

/*
一次修改相同类型的多个列的示例。
这里是把key3,key5列中所有的null值替换成1024。
csv导入时默认是string,如果是整型,写法是一样的,有各个类型的重载。
*/
scala> val df_3 = df.na.fill("1024",Seq("key3","key5"))
df_3: org.apache.spark.sql.DataFrame = [key1: string, key2: string ... 3 more fields]

scala> df_3.show
+----+----+----+----+----+
|key1|key2|key3|key4|key5|
+----+----+----+----+----+
| aaa|   1|   2|  t1|   4|
| bbb|   5|   3|  t2|   8|
| ccc|   2|   2|null|   7|
|null|   7|   3|  t1|1024|
| bbb|   1|   5|  t3|   0|
|null|   4|1024|  t1|   8|
+----+----+----+----+----+

/*
一次修改不同类型的多个列的示例。
csv导入时默认是string,如果是整型,写法是一样的,有各个类型的重载。
*/
scala> val df_3 = df.na.fill(Map(("key1"->"yyy"),("key3","1024"),("key4","t88"),("key5","4096")))
df_3: org.apache.spark.sql.DataFrame = [key1: string, key2: string ... 3 more fields]

scala> df_3.show
+----+----+----+----+----+
|key1|key2|key3|key4|key5|
+----+----+----+----+----+
| aaa|   1|   2|  t1|   4|
| bbb|   5|   3|  t2|   8|
| ccc|   2|   2| t88|   7|
| yyy|   7|   3|  t1|4096|
| bbb|   1|   5|  t3|   0|
| yyy|   4|1024|  t1|   8|
+----+----+----+----+----+

/*
不修改,只是过滤掉含有null值的行。
这里是过滤掉key3,key5列中含有null的行
*/
scala> val df_4 = df.na.drop(Seq("key3","key5"))
df_4: org.apache.spark.sql.DataFrame = [key1: string, key2: string ... 3 more fields]

scala> df_4.show
+----+----+----+----+----+
|key1|key2|key3|key4|key5|
+----+----+----+----+----+
| aaa|   1|   2|  t1|   4|
| bbb|   5|   3|  t2|   8|
| ccc|   2|   2|null|   7|
| bbb|   1|   5|  t3|   0|
+----+----+----+----+----+

/*
过滤掉指定的若干列中,有效值少于n列的行
这里是过滤掉key1,key2,key3这3列中有效值小于2列的行。最后一行中,这3列有2列都是null,所以被过滤掉了。
*/
scala> val df_5 = df.na.drop(2,Seq("key1","key2","key3"))
df_5: org.apache.spark.sql.DataFrame = [key1: string, key2: string ... 3 more fields]

scala> df.show
+----+----+----+----+----+
|key1|key2|key3|key4|key5|
+----+----+----+----+----+
| aaa|   1|   2|  t1|   4|
| bbb|   5|   3|  t2|   8|
| ccc|   2|   2|null|   7|
|null|   7|   3|  t1|null|
| bbb|   1|   5|  t3|   0|
|null|   4|null|  t1|   8|
+----+----+----+----+----+

scala> df_5.show
+----+----+----+----+----+
|key1|key2|key3|key4|key5|
+----+----+----+----+----+
| aaa|   1|   2|  t1|   4|
| bbb|   5|   3|  t2|   8|
| ccc|   2|   2|null|   7|
|null|   7|   3|  t1|null|
| bbb|   1|   5|  t3|   0|
+----+----+----+----+----+

/*
同上,如果不指定列名列表,则默认列名列表就是所有列
*/
scala> val df_6 = df.na.drop(4)
df_6: org.apache.spark.sql.DataFrame = [key1: string, key2: string ... 3 more fields]

scala> df_6.show
+----+----+----+----+----+
|key1|key2|key3|key4|key5|
+----+----+----+----+----+
| aaa|   1|   2|  t1|   4|
| bbb|   5|   3|  t2|   8|
| ccc|   2|   2|null|   7|
| bbb|   1|   5|  t3|   0|
+----+----+----+----+----+

多表操作 join

 

先看两个源数据表的定义:

scala> val df1 = spark.createDataset(Seq(("aaa", 1, 2), ("bbb", 3, 4), ("ccc", 3, 5), ("bbb", 4, 6)) ).toDF("key1","key2","key3")
df1: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 1 more field]

scala> val df2 = spark.createDataset(Seq(("aaa", 2, 2),    ("bbb", 3, 5),    ("ddd", 3, 5),    ("bbb", 4, 6), ("eee", 1, 2), ("aaa", 1, 5), ("fff",5,6))).toDF("key1","key2","key4")
df2: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 1 more field]

scala> df1.printSchema
root
 |-- key1: string (nullable = true)
 |-- key2: integer (nullable = false)
 |-- key3: integer (nullable = false)


scala> df2.printSchema
root
 |-- key1: string (nullable = true)
 |-- key2: integer (nullable = false)
 |-- key4: integer (nullable = false)

scala> df1.show()
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa|   1|   2|
| bbb|   3|   4|
| ccc|   3|   5|
| bbb|   4|   6|
+----+----+----+

scala> df2.show()
+----+----+----+
|key1|key2|key4|
+----+----+----+
| aaa|   2|   2|
| bbb|   3|   5|
| ddd|   3|   5|
| bbb|   4|   6|
| eee|   1|   2|
| aaa|   1|   5|
| fff|   5|   6|
+----+----+----+

Spark对join的支持很丰富,等值连接,条件连接,自然连接都支持。连接类型包括内连接,外连接,左外连接,右外连接,左半连接以及笛卡尔连接。

下面一一示例,先看内连接

/*
内连接 select * from df1 join df2 on df1.key1=df2.key1
*/
scala> val df3 = df1.join(df2,"key1")
df3: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 3 more fields]

scala> df3.printSchema
root
 |-- key1: string (nullable = true)
 |-- key2: integer (nullable = false)
 |-- key3: integer (nullable = false)
 |-- key2: integer (nullable = false)
 |-- key4: integer (nullable = false)

scala> df3.show
+----+----+----+----+----+
|key1|key2|key3|key2|key4|
+----+----+----+----+----+
| aaa|   1|   2|   1|   5|
| aaa|   1|   2|   2|   2|
| bbb|   3|   4|   4|   6|
| bbb|   3|   4|   3|   5|
| bbb|   4|   6|   4|   6|
| bbb|   4|   6|   3|   5|
+----+----+----+----+----+

/*
还是内连接,这次用joinWith。和join的区别是连接后的新Dataset的schema会不一样,注意和上面的对比一下。
*/
scala> val df4=df1.joinWith(df2,df1("key1")===df2("key1"))
df4: org.apache.spark.sql.Dataset[(org.apache.spark.sql.Row, org.apache.spark.sql.Row)] = [_1: struct<key1: string, key2: int ... 1 more field>, _2: struct<key1: string, key2: int ... 1 more field>]

scala> df4.printSchema
root
 |-- _1: struct (nullable = false)
 |    |-- key1: string (nullable = true)
 |    |-- key2: integer (nullable = false)
 |    |-- key3: integer (nullable = false)
 |-- _2: struct (nullable = false)
 |    |-- key1: string (nullable = true)
 |    |-- key2: integer (nullable = false)
 |    |-- key4: integer (nullable = false)

scala> df4.show
+---------+---------+
|       _1|       _2|
+---------+---------+
|[aaa,1,2]|[aaa,1,5]|
|[aaa,1,2]|[aaa,2,2]|
|[bbb,3,4]|[bbb,4,6]|
|[bbb,3,4]|[bbb,3,5]|
|[bbb,4,6]|[bbb,4,6]|
|[bbb,4,6]|[bbb,3,5]|
+---------+---------+

然后是外连接:

/*
select * from df1 outer join df2 on df1.key1=df2.key1 
*/
scala> val df5 = df1.join(df2,df1("key1")===df2("key1"), "outer")
df5: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 4 more fields]

scala> df5.show
+----+----+----+----+----+----+
|key1|key2|key3|key1|key2|key4|
+----+----+----+----+----+----+
|null|null|null| ddd|   3|   5|
| ccc|   3|   5|null|null|null|
| aaa|   1|   2| aaa|   2|   2|
| aaa|   1|   2| aaa|   1|   5|
| bbb|   3|   4| bbb|   3|   5|
| bbb|   3|   4| bbb|   4|   6|
| bbb|   4|   6| bbb|   3|   5|
| bbb|   4|   6| bbb|   4|   6|
|null|null|null| fff|   5|   6|
|null|null|null| eee|   1|   2|
+----+----+----+----+----+----+

下面是左外连接,右外连接和左半连接:

/*
左外连接
*/
scala> val df6 = df1.join(df2,df1("key1")===df2("key1"), "left_outer")
df6: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 4 more fields]

scala> df6.show
+----+----+----+----+----+----+
|key1|key2|key3|key1|key2|key4|
+----+----+----+----+----+----+
| aaa|   1|   2| aaa|   1|   5|
| aaa|   1|   2| aaa|   2|   2|
| bbb|   3|   4| bbb|   4|   6|
| bbb|   3|   4| bbb|   3|   5|
| ccc|   3|   5|null|null|null|
| bbb|   4|   6| bbb|   4|   6|
| bbb|   4|   6| bbb|   3|   5|
+----+----+----+----+----+----+

/*
右外连接
*/
scala> val df7 = df1.join(df2,df1("key1")===df2("key1"), "right_outer")
df7: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 4 more fields]

scala> df7.show
+----+----+----+----+----+----+
|key1|key2|key3|key1|key2|key4|
+----+----+----+----+----+----+
| aaa|   1|   2| aaa|   2|   2|
| bbb|   4|   6| bbb|   3|   5|
| bbb|   3|   4| bbb|   3|   5|
|null|null|null| ddd|   3|   5|
| bbb|   4|   6| bbb|   4|   6|
| bbb|   3|   4| bbb|   4|   6|
|null|null|null| eee|   1|   2|
| aaa|   1|   2| aaa|   1|   5|
|null|null|null| fff|   5|   6|
+----+----+----+----+----+----+

/*
左半连接
*/
scala> val df8 = df1.join(df2,df1("key1")===df2("key1"), "leftsemi")
df8: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 1 more field]

scala> df8.show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa|   1|   2|
| bbb|   3|   4|
| bbb|   4|   6|
+----+----+----+

笛卡尔连接不太常用,毕竟现在用spark玩的表都大得很,做这种全连接成本太大了。

/*
笛卡尔连接
*/
scala> val df9 = df1.crossJoin(df2)
df9: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 4 more fields]

scala> df9.count
res17: Long = 28

/* 就显示前10条结果吧 */
scala> df9.show(10)
+----+----+----+----+----+----+
|key1|key2|key3|key1|key2|key4|
+----+----+----+----+----+----+
| aaa|   1|   2| aaa|   2|   2|
| aaa|   1|   2| bbb|   3|   5|
| aaa|   1|   2| ddd|   3|   5|
| aaa|   1|   2| bbb|   4|   6|
| aaa|   1|   2| eee|   1|   2|
| aaa|   1|   2| aaa|   1|   5|
| aaa|   1|   2| fff|   5|   6|
| bbb|   3|   4| aaa|   2|   2|
| bbb|   3|   4| bbb|   3|   5|
| bbb|   3|   4| ddd|   3|   5|
+----+----+----+----+----+----+
only showing top 10 rows

下面这个例子还是个等值连接,区别之前的等值连接是去调用两个表的重复列,就像自然连接一样:

/*
基于两个公共字段key1和key的等值连接
*/
scala> val df10 = df1.join(df2, Seq("key1","key2"))
df10: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 2 more fields]

scala> df10.show
+----+----+----+----+
|key1|key2|key3|key4|
+----+----+----+----+
| aaa|   1|   2|   5|
| bbb|   3|   4|   5|
| bbb|   4|   6|   6|
+----+----+----+----+

条件连接在spark的低版本好像是不支持的,反正现在是ok啦~

/*
select df1.*,df2.* from df1 join df2 
on df1.key1=df2.key1 and df1.key2>df2.key2
*/
scala> val df11 = df1.join(df2, df1("key1")===df2("key1") && df1("key2")>df2("key2"))
df11: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 4 more fields]

scala> df11.show
+----+----+----+----+----+----+
|key1|key2|key3|key1|key2|key4|
+----+----+----+----+----+----+
| bbb|   4|   6| bbb|   3|   5|
+----+----+----+----+----+----+