一,基本介绍

本文主要讲spark2.0版本以后存在的Sparksql的一些实用的函数,帮助解决复杂嵌套的json数据格式,比如,map和嵌套结构。Spark2.1在spark 的Structured Streaming也可以使用这些功能函数。

下面几个是本文重点要讲的方法。

A),get_json_object()

B),from_json()

C),to_json()

D),explode()

E),selectExpr()

二,准备阶段

首先,创建一个没有任何嵌套的JSon Schema

import org.apache.spark.sql.types._
 import org.apache.spark.sql.functions._
 val jsonSchema = new StructType().add(“battery_level”, LongType).add(“c02_level”, LongType).add(“cca3”,StringType).add(“cn”, StringType).add(“device_id”, LongType).add(“device_type”, StringType).add(“signal”, LongType).add(“ip”, StringType).add(“temp”, LongType).add(“timestamp”, TimestampType)

使用上面的schema,我在这里创建一个Dataframe,使用的是scala 的case class,同时会产生一些json格式的数据。当然,生产中这些数据也可以来自于kafka。这个case class总共有两个字段:整型(作为device id)和一个字符串(json的数据结构,代表设备的事件)

// define a case class
 case class DeviceData (id: Int, device: String)
 // create some sample data
 val eventsDS = Seq (
 (0, “”"{“device_id”: 0, “device_type”: “sensor-ipad”, “ip”: “68.161.225.1”, “cca3”: “USA”, “cn”: “United States”, “temp”: 25, “signal”: 23, “battery_level”: 8, “c02_level”: 917, “timestamp” :1475600496 }"""),
 (1, “”"{“device_id”: 1, “device_type”: “sensor-igauge”, “ip”: “213.161.254.1”, “cca3”: “NOR”, “cn”: “Norway”, “temp”: 30, “signal”: 18, “battery_level”: 6, “c02_level”: 1413, “timestamp” :1475600498 }"""),
 (2, “”"{“device_id”: 2, “device_type”: “sensor-ipad”, “ip”: “88.36.5.1”, “cca3”: “ITA”, “cn”: “Italy”, “temp”: 18, “signal”: 25, “battery_level”: 5, “c02_level”: 1372, “timestamp” :1475600500 }"""),
 (3, “”"{“device_id”: 3, “device_type”: “sensor-inest”, “ip”: “66.39.173.154”, “cca3”: “USA”, “cn”: “United States”, “temp”: 47, “signal”: 12, “battery_level”: 1, “c02_level”: 1447, “timestamp” :1475600502 }"""),
 (4, “”"{“device_id”: 4, “device_type”: “sensor-ipad”, “ip”: “203.82.41.9”, “cca3”: “PHL”, “cn”: “Philippines”, “temp”: 29, “signal”: 11, “battery_level”: 0, “c02_level”: 983, “timestamp” :1475600504 }"""),
 (5, “”"{“device_id”: 5, “device_type”: “sensor-istick”, “ip”: “204.116.105.67”, “cca3”: “USA”, “cn”: “United States”, “temp”: 50, “signal”: 16, “battery_level”: 8, “c02_level”: 1574, “timestamp” :1475600506 }"""),
 (6, “”"{“device_id”: 6, “device_type”: “sensor-ipad”, “ip”: “220.173.179.1”, “cca3”: “CHN”, “cn”: “China”, “temp”: 21, “signal”: 18, “battery_level”: 9, “c02_level”: 1249, “timestamp” :1475600508 }"""),
 (7, “”"{“device_id”: 7, “device_type”: “sensor-ipad”, “ip”: “118.23.68.227”, “cca3”: “JPN”, “cn”: “Japan”, “temp”: 27, “signal”: 15, “battery_level”: 0, “c02_level”: 1531, “timestamp” :1475600512 }"""),
 (8 ,""" {“device_id”: 8, “device_type”: “sensor-inest”, “ip”: “208.109.163.218”, “cca3”: “USA”, “cn”: “United States”, “temp”: 40, “signal”: 16, “battery_level”: 9, “c02_level”: 1208, “timestamp” :1475600514 }"""),
 (9,"""{“device_id”: 9, “device_type”: “sensor-ipad”, “ip”: “88.213.191.34”, “cca3”: “ITA”, “cn”: “Italy”, “temp”: 19, “signal”: 11, “battery_level”: 0, “c02_level”: 1171, “timestamp” :1475600516 }"""),
 (10,"""{“device_id”: 10, “device_type”: “sensor-igauge”, “ip”: “68.28.91.22”, “cca3”: “USA”, “cn”: “United States”, “temp”: 32, “signal”: 26, “battery_level”: 7, “c02_level”: 886, “timestamp” :1475600518 }"""),
 (11,"""{“device_id”: 11, “device_type”: “sensor-ipad”, “ip”: “59.144.114.250”, “cca3”: “IND”, “cn”: “India”, “temp”: 46, “signal”: 25, “battery_level”: 4, “c02_level”: 863, “timestamp” :1475600520 }"""),
 (12, “”"{“device_id”: 12, “device_type”: “sensor-igauge”, “ip”: “193.156.90.200”, “cca3”: “NOR”, “cn”: “Norway”, “temp”: 18, “signal”: 26, “battery_level”: 8, “c02_level”: 1220, “timestamp” :1475600522 }"""),
 (13, “”"{“device_id”: 13, “device_type”: “sensor-ipad”, “ip”: “67.185.72.1”, “cca3”: “USA”, “cn”: “United States”, “temp”: 34, “signal”: 20, “battery_level”: 8, “c02_level”: 1504, “timestamp” :1475600524 }"""),
 (14, “”"{“device_id”: 14, “device_type”: “sensor-inest”, “ip”: “68.85.85.106”, “cca3”: “USA”, “cn”: “United States”, “temp”: 39, “signal”: 17, “battery_level”: 8, “c02_level”: 831, “timestamp” :1475600526 }"""),
 (15, “”"{“device_id”: 15, “device_type”: “sensor-ipad”, “ip”: “161.188.212.254”, “cca3”: “USA”, “cn”: “United States”, “temp”: 27, “signal”: 26, “battery_level”: 5, “c02_level”: 1378, “timestamp” :1475600528 }"""),
 (16, “”"{“device_id”: 16, “device_type”: “sensor-igauge”, “ip”: “221.3.128.242”, “cca3”: “CHN”, “cn”: “China”, “temp”: 10, “signal”: 24, “battery_level”: 6, “c02_level”: 1423, “timestamp” :1475600530 }"""),
 (17, “”"{“device_id”: 17, “device_type”: “sensor-ipad”, “ip”: “64.124.180.215”, “cca3”: “USA”, “cn”: “United States”, “temp”: 38, “signal”: 17, “battery_level”: 9, “c02_level”: 1304, “timestamp” :1475600532 }"""),
 (18, “”"{“device_id”: 18, “device_type”: “sensor-igauge”, “ip”: “66.153.162.66”, “cca3”: “USA”, “cn”: “United States”, “temp”: 26, “signal”: 10, “battery_level”: 0, “c02_level”: 902, “timestamp” :1475600534 }"""),
 (19, “”"{“device_id”: 19, “device_type”: “sensor-ipad”, “ip”: “193.200.142.254”, “cca3”: “AUT”, “cn”: “Austria”, “temp”: 32, “signal”: 27, “battery_level”: 5, “c02_level”: 1282, “timestamp” :1475600536 }""")).toDF(“id”, “device”).as[DeviceData]

三,如何使用get_json_object()

该方法从spark1.6开始就有了,从一个json 字符串中根据指定的json 路径抽取一个json 对象。从上面的dataset中取出部分数据,然后抽取部分字段组装成新的json 对象。比如,我们仅仅抽取:id,devicetype,ip,CCA3 code.

val eventsFromJSONDF = Seq (
 (0, “”"{“device_id”: 0, “device_type”: “sensor-ipad”, “ip”: “68.161.225.1”, “cca3”: “USA”, “cn”: “United States”, “temp”: 25, “signal”: 23, “battery_level”: 8, “c02_level”: 917, “timestamp” :1475600496 }"""),
 (1, “”"{“device_id”: 1, “device_type”: “sensor-igauge”, “ip”: “213.161.254.1”, “cca3”: “NOR”, “cn”: “Norway”, “temp”: 30, “signal”: 18, “battery_level”: 6, “c02_level”: 1413, “timestamp” :1475600498 }"""),
 (2, “”"{“device_id”: 2, “device_type”: “sensor-ipad”, “ip”: “88.36.5.1”, “cca3”: “ITA”, “cn”: “Italy”, “temp”: 18, “signal”: 25, “battery_level”: 5, “c02_level”: 1372, “timestamp” :1475600500 }"""),
 (3, “”"{“device_id”: 3, “device_type”: “sensor-inest”, “ip”: “66.39.173.154”, “cca3”: “USA”, “cn”: “United States”, “temp”: 47, “signal”: 12, “battery_level”: 1, “c02_level”: 1447, “timestamp” :1475600502 }"""),
 (4, “”"{“device_id”: 4, “device_type”: “sensor-ipad”, “ip”: “203.82.41.9”, “cca3”: “PHL”, “cn”: “Philippines”, “temp”: 29, “signal”: 11, “battery_level”: 0, “c02_level”: 983, “timestamp” :1475600504 }"""),
 (5, “”"{“device_id”: 5, “device_type”: “sensor-istick”, “ip”: “204.116.105.67”, “cca3”: “USA”, “cn”: “United States”, “temp”: 50, “signal”: 16, “battery_level”: 8, “c02_level”: 1574, “timestamp” :1475600506 }"""),
 (6, “”"{“device_id”: 6, “device_type”: “sensor-ipad”, “ip”: “220.173.179.1”, “cca3”: “CHN”, “cn”: “China”, “temp”: 21, “signal”: 18, “battery_level”: 9, “c02_level”: 1249, “timestamp” :1475600508 }"""),
 (7, “”"{“device_id”: 7, “device_type”: “sensor-ipad”, “ip”: “118.23.68.227”, “cca3”: “JPN”, “cn”: “Japan”, “temp”: 27, “signal”: 15, “battery_level”: 0, “c02_level”: 1531, “timestamp” :1475600512 }"""),
 (8 ,""" {“device_id”: 8, “device_type”: “sensor-inest”, “ip”: “208.109.163.218”, “cca3”: “USA”, “cn”: “United States”, “temp”: 40, “signal”: 16, “battery_level”: 9, “c02_level”: 1208, “timestamp” :1475600514 }"""),
 (9,"""{“device_id”: 9, “device_type”: “sensor-ipad”, “ip”: “88.213.191.34”, “cca3”: “ITA”, “cn”: “Italy”, “temp”: 19, “signal”: 11, “battery_level”: 0, “c02_level”: 1171, “timestamp” :1475600516 }""")).toDF(“id”, “json”)

测试及输出

val jsDF = eventsFromJSONDF.select(“json”, "“json”, "“json”, “$.cca3”).alias(“cca3”))
 jsDF.printSchemajsDF.show

四,如何使用from_json()

与get_json_object不同的是该方法,使用schema去抽取单独列。在dataset的api select中使用from_json()方法,我可以从一个json 字符串中按照指定的schema格式抽取出来作为DataFrame的列。还有,我们也可以将所有在json中的属性和值当做一个devices的实体。我们不仅可以使用device.arrtibute去获取特定值,也可以使用*通配符。

下面的例子,主要实现如下功能:

A),使用上述schema从json字符串中抽取属性和值,并将它们视为devices的独立列。

B),select所有列

C),使用.,获取部分列。

val devicesDF = eventsDS.select(from_json(“devices.*”).filter($“devices.temp” > 10 and $“devices.signal” > 15)

五,如何使用to_json()

下面使用to_json()将获取的数据转化为json格式。将结果重新写入kafka或者保存partquet文件。

val stringJsonDF = eventsDS.select(to_json(struct($"*"))).toDF(“devices”)
stringJsonDF.show

保存数据到kafka

stringJsonDF.write.format(“kafka”).option(“kafka.bootstrap.servers”, “localhost:9092”).option(“topic”, “iot-devices”).save()

注意依赖

groupId = org.apache.spark
 artifactId = spark-sql-kafka-0-10_2.11
 version = 2.1.0

六,如何使用selectExpr()

将列转化为一个JSON对象的另一种方式是使用selectExpr()功能函数。例如我们可以将device列转化为一个JSON对象。

val stringsDF = eventsDS.selectExpr(“CAST(id AS INT)”, “CAST(device AS STRING)”)
stringsDF.show

SelectExpr()方法的另一个用法,就是使用表达式作为参数,将它们转化为指定的列。如下:

devicesDF.selectExpr(“c02_level”, “round(c02_level/temp) as ratio_c02_temperature”).orderBy($“ratio_c02_temperature” desc).show

使用Sparksql的slq语句是很好写的

首先注册成临时表,然后写sql

devicesDF.createOrReplaceTempView(“devicesDFT”)
 spark.sql(“select c02_level,round(c02_level/temp) as ratio_c02_temperature from devicesDFT order by ratio_c02_temperature desc”).show

七,验证

为了验证我们的DataFrame转化为json String是成功的我们将结果写入本地磁盘。

stringJsonDF.write.mode(“overwrite”).format(“parquet”).save(“file:///opt/jules”)

读入

val parquetDF = spark.read.parquet(“file:///opt/jules”)

一,准备阶段

Json格式里面有map结构和嵌套json也是很合理的。本文将举例说明如何用spark解析包含复杂的嵌套数据结构,map。现实中的例子是,一个设备的检测事件,二氧化碳的安全你浓度,高温数据等,需要实时产生数据,然后及时的告警处理。

1,定义schema

import org.apache.spark.sql.types._
val schema = new StructType()
 .add(“dc_id”, StringType) // data center where data was posted to Kafka cluster
 .add(“source”, // info about the source of alarm
 MapType( // define this as a Map(Key->value)
 StringType,
 new StructType()
 .add(“description”, StringType)
 .add(“ip”, StringType)
 .add(“id”, LongType)
 .add(“temp”, LongType)
 .add(“c02_level”, LongType)
 .add(“geo”,
 new StructType()
 .add(“lat”, DoubleType)
 .add(“long”, DoubleType)
 )
 )
 )

2,准备数据

val dataDS = Seq("""
 {
 “dc_id”: “dc-101”,
 “source”: {
 “sensor-igauge”: {
 “id”: 10,
 “ip”: “68.28.91.22”,
 “description”: “Sensor attached to the container ceilings”,
 “temp”:35,
 “c02_level”: 1475,
 “geo”: {“lat”:38.00, “long”:97.00}
 },
 “sensor-ipad”: {
 “id”: 13,
 “ip”: “67.185.72.1”,
 “description”: “Sensor ipad attached to carbon cylinders”,
 “temp”: 34,
 “c02_level”: 1370,
 “geo”: {“lat”:47.41, “long”:-122.00}
 },
 “sensor-inest”: {
 “id”: 8,
 “ip”: “208.109.163.218”,
 “description”: “Sensor attached to the factory ceilings”,
 “temp”: 40,
 “c02_level”: 1346,
 “geo”: {“lat”:33.61, “long”:-111.89}
 },
 “sensor-istick”: {
 “id”: 5,
 “ip”: “204.116.105.67”,
 “description”: “Sensor embedded in exhaust pipes in the ceilings”,
 “temp”: 40,
 “c02_level”: 1574,
 “geo”: {“lat”:35.93, “long”:-85.46}
 }
 }
 }""").toDS()
 // should only be one item
 dataDS.count()

3,准备处理

val df = spark.read.schema(schema).json(dataDS.rdd)

查看schema

df.printSchema

二,如何使用explode()

Explode()方法在spark1.3的时候就已经存在了,在这里展示一下如何抽取嵌套的数据结构。在一些场合,会结合explode,to_json,from_json一起使用。

Explode为给定的map的每一个元素创建一个新的行。比如上面准备的数据,source就是一个map结构。Map中的每一个key/value对都会是一个独立的行。

val explodedDF = df.select(“source”))
 explodedDF.printSchema

可以看看操作之后的schema信息

获取内部的 数据

case class DeviceAlert(dcId: String, deviceType:String, ip:String, deviceId:Long, temp:Long, c02_level: Long, lat: Double, lon: Double)
 val notifydevicesDS = explodedDF.select( $“dc_id” as “dcId”,
 $“key” as “deviceType”,
 'value.getItem(“ip”) as 'ip,
 'value.getItem(“id”) as 'deviceId,
 'value.getItem(“c02_level”) as 'c02_level,
 'value.getItem(“temp”) as 'temp,
 'value.getItem(“geo”).getItem(“lat”) as 'lat, //note embedded level requires yet another level of fetching.
 'value.getItem(“geo”).getItem(“long”) as 'lon)
 .as[DeviceAlert] // return as a Dataset

查看schema信息

notifydevicesDS.printSchema

三,再复杂一点

在物联网场景里,通畅物联网设备会将很多json 事件数据发给他的收集器。收集器可以是附近的数据中心,也可以是附近的聚合器,也可以是安装在家里的一个设备,它会有规律的周期的将数据通过加密的互联网发给远程的数据中心。说白一点,数据格式更复杂。

我们下面会有三个map的数据格式:恒温计,摄像机,烟雾报警器。

import org.apache.spark.sql.types._
// a bit longish, nested, and convuloted JSON schema ?
 val nestSchema2 = new StructType()
 .add(“devices”,
 new StructType()
 .add(“thermostats”, MapType(StringType,
 new StructType()
 .add(“device_id”, StringType)
 .add(“locale”, StringType)
 .add(“software_version”, StringType)
 .add(“structure_id”, StringType)
 .add(“where_name”, StringType)
 .add(“last_connection”, StringType)
 .add(“is_online”, BooleanType)
 .add(“can_cool”, BooleanType)
 .add(“can_heat”, BooleanType)
 .add(“is_using_emergency_heat”, BooleanType)
 .add(“has_fan”, BooleanType)
 .add(“fan_timer_active”, BooleanType)
 .add(“fan_timer_timeout”, StringType)
 .add(“temperature_scale”, StringType)
 .add(“target_temperature_f”, DoubleType)
 .add(“target_temperature_high_f”, DoubleType)
 .add(“target_temperature_low_f”, DoubleType)
 .add(“eco_temperature_high_f”, DoubleType)
 .add(“eco_temperature_low_f”, DoubleType)
 .add(“away_temperature_high_f”, DoubleType)
 .add(“away_temperature_low_f”, DoubleType)
 .add(“hvac_mode”, StringType)
 .add(“humidity”, LongType)
 .add(“hvac_state”, StringType)
 .add(“is_locked”, StringType)
 .add(“locked_temp_min_f”, DoubleType)
 .add(“locked_temp_max_f”, DoubleType)))
 .add(“smoke_co_alarms”, MapType(StringType,
 new StructType()
 .add(“device_id”, StringType)
 .add(“locale”, StringType)
 .add(“software_version”, StringType)
 .add(“structure_id”, StringType)
 .add(“where_name”, StringType)
 .add(“last_connection”, StringType)
 .add(“is_online”, BooleanType)
 .add(“battery_health”, StringType)
 .add(“co_alarm_state”, StringType)
 .add(“smoke_alarm_state”, StringType)
 .add(“is_manual_test_active”, BooleanType)
 .add(“last_manual_test_time”, StringType)
 .add(“ui_color_state”, StringType)))
 .add(“cameras”, MapType(StringType,
 new StructType()
 .add(“device_id”, StringType)
 .add(“software_version”, StringType)
 .add(“structure_id”, StringType)
 .add(“where_name”, StringType)
 .add(“is_online”, BooleanType)
 .add(“is_streaming”, BooleanType)
 .add(“is_audio_input_enabled”, BooleanType)
 .add(“last_is_online_change”, StringType)
 .add(“is_video_history_enabled”, BooleanType)
 .add(“web_url”, StringType)
 .add(“app_url”, StringType)
 .add(“is_public_share_enabled”, BooleanType)
 .add(“activity_zones”,
 new StructType()
 .add(“name”, StringType)
 .add(“id”, LongType))
 .add(“last_event”, StringType))))

对应的数据

val nestDataDS2 = Seq("""{
 “devices”: {
 “thermostats”: {
 “peyiJNo0IldT2YlIVtYaGQ”: {
 “device_id”: “peyiJNo0IldT2YlIVtYaGQ”,
 “locale”: “en-US”,
 “software_version”: “4.0”,
 “structure_id”: “VqFabWH21nwVyd4RWgJgNb292wa7hG_dUwo2i2SG7j3-BOLY0BA4sw”,
 “where_name”: “Hallway Upstairs”,
 “last_connection”: “2016-10-31T23:59:59.000Z”,
 “is_online”: true,
 “can_cool”: true,
 “can_heat”: true,
 “is_using_emergency_heat”: true,
 “has_fan”: true,
 “fan_timer_active”: true,
 “fan_timer_timeout”: “2016-10-31T23:59:59.000Z”,
 “temperature_scale”: “F”,
 “target_temperature_f”: 72,
 “target_temperature_high_f”: 80,
 “target_temperature_low_f”: 65,
 “eco_temperature_high_f”: 80,
 “eco_temperature_low_f”: 65,
 “away_temperature_high_f”: 80,
 “away_temperature_low_f”: 65,
 “hvac_mode”: “heat”,
 “humidity”: 40,
 “hvac_state”: “heating”,
 “is_locked”: true,
 “locked_temp_min_f”: 65,
 “locked_temp_max_f”: 80
 }
 },
 “smoke_co_alarms”: {
 “RTMTKxsQTCxzVcsySOHPxKoF4OyCifrs”: {
 “device_id”: “RTMTKxsQTCxzVcsySOHPxKoF4OyCifrs”,
 “locale”: “en-US”,
 “software_version”: “1.01”,
 “structure_id”: “VqFabWH21nwVyd4RWgJgNb292wa7hG_dUwo2i2SG7j3-BOLY0BA4sw”,
 “where_name”: “Jane’s Room”,
 “last_connection”: “2016-10-31T23:59:59.000Z”,
 “is_online”: true,
 “battery_health”: “ok”,
 “co_alarm_state”: “ok”,
 “smoke_alarm_state”: “ok”,
 “is_manual_test_active”: true,
 “last_manual_test_time”: “2016-10-31T23:59:59.000Z”,
 “ui_color_state”: “gray”
 }
 },
 “cameras”: {
 “awJo6rH0IldT2YlIVtYaGQ”: {
 “device_id”: “awJo6rH”,
 “software_version”: “4.0”,
 “structure_id”: “VqFabWH21nwVyd4RWgJgNb292wa7hG_dUwo2i2SG7j3-BOLY0BA4sw”,
 “where_name”: “Foyer”,
 “is_online”: true,
 “is_streaming”: true,
 “is_audio_input_enabled”: true,
 “last_is_online_change”: “2016-12-29T18:42:00.000Z”,
 “is_video_history_enabled”: true,
 “web_url”: “https://home.nest.com/cameras/device_id?auth=access_token”,
 “app_url”: “nestmobile://cameras/device_id?auth=access_token”,
 “is_public_share_enabled”: true,
 “activity_zones”: { “name”: “Walkway”, “id”: 244083 },
 “last_event”: “2016-10-31T23:59:59.000Z”
 }
 }
 }
 }""").toDS

通过创建一个简单的dataset,我们可以使用所有的dataset的方法来进行ETL操作,比如from_json(), to_json(), explode() and selectExpr()。

val nestDF2 = spark // spark session
 .read // get DataFrameReader
 .schema(nestSchema2) // use the defined schema above and read format as JSON
 .json(nestDataDS2.rdd)

2,将整个json对象,转化为一个json string

val stringJsonDF = nestDF2.select(to_json(struct($"*"))).toDF(“nestDevice”)

3,将三个json object 的map对象抓化为三个单独的map列,然后可以是使用explode方法访问其属性。

val mapColumnsDF = nestDF2.select($“devices”.getItem(“smoke_co_alarms”).alias (“smoke_alarms”),
 $“devices”.getItem(“cameras”).alias (“cameras”),
 $“devices”.getItem(“thermostats”).alias (“thermostats”))

转化为三个dataframe

val explodedThermostatsDF = mapColumnsDF.select(explode(“cameras”))
 //or you could use the original nestDF2 and use the devices.X notation
 val explodedSmokedAlarmsDF = nestDF2.select(explode($“devices.smoke_co_alarms”))

查看其schema

explodedThermostatsDF.printSchema

访问三个map内部的元素

val thermostateDF = explodedThermostatsDF.select($“value”.getItem(“device_id”).alias(“device_id”),
 $“value”.getItem(“locale”).alias(“locale”),
 $“value”.getItem(“where_name”).alias(“location”),
 $“value”.getItem(“last_connection”).alias(“last_connected”),
 $“value”.getItem(“humidity”).alias(“humidity”),
 $“value”.getItem(“target_temperature_f”).alias(“target_temperature_f”),
 $“value”.getItem(“hvac_mode”).alias(“mode”),
 $“value”.getItem(“software_version”).alias(“version”))val cameraDF = explodedCamerasDF.select($“value”.getItem(“device_id”).alias(“device_id”),
 $“value”.getItem(“where_name”).alias(“location”),
 $“value”.getItem(“software_version”).alias(“version”),
 $“value”.getItem(“activity_zones”).getItem(“name”).alias(“name”),
 $“value”.getItem(“activity_zones”).getItem(“id”).alias(“id”))val smokedAlarmsDF = explodedSmokedAlarmsDF.select($“value”.getItem(“device_id”).alias(“device_id”),
 $“value”.getItem(“where_name”).alias(“location”),
 $“value”.getItem(“software_version”).alias(“version”),
 $“value”.getItem(“last_connection”).alias(“last_connected”),
 $“value”.getItem(“battery_health”).alias(“battery_health”))

查看内部数据

cameraDF.show

通过version进行join操作

val joineDFs = thermostateDF.join(cameraDF, “version”)

四,总结

这篇文章的重点是介绍几个好用的工具,去获取复杂的嵌套的json数据格式。一旦你将嵌套数据扁平化之后,再进行访问,就跟普通的数据格式没啥区别了。