前言
在数据存储时研发端经常为了方便很多字段为了冗余存储了文本格式:json或者固定分隔符,但是数仓和数据分析在处理的时候又要进行取出单独的某个字段,进行使用,本文会详解下平时在工作中遇到的问题与处理办法。
字符串固定分隔符处理
案例:
字符串以:|*|分隔,k-v以:分隔,取出其中scene_ids字段值
字符串:
|*|cityId:201|*|qId:4890f057-8c8d-4fd6-8029-9ec8a374ab3b|*|st:poi|*|sId:62778388|*|scene_ids:6,0|*|pos:0|*|hId:92091235|*|Tp:default|*|page_index:0|*|page_size:20|*
测试表:
with tab as (
select
'|*|cityId:201|*|qId:4890f057-8c8d-4fd6-8029-9ec8a374ab3b|*|st:poi|*|sId:62778388|*|scene_ids:6,0|*|pos:0|*|hId:92091235|*|Tp:default|*|page_index:0|*|page_size:20|*|' as cl_name
)
方法一:正则表达处理
regexp_extract(str, regexp[, idx])
str是被解析的字符串或字段名
regexp 是正则表达式
idx是返回结果 取表达式的哪一部分 默认值为1。
0表示把整个正则表达式对应的结果全部返回
1表示返回正则表达式中第一个() 对应的结果 以此类推
注:使用标识符是需要\转义,比如正则数字\d需要使用\d
select cl_name,regexp_extract(cl_name,'scene_ids:(\\d+)',1) as scene_ids from tab
结果:
scene_ids
6
方法二:先转化成map再取key
str_to_map(text[, delimiter1, delimiter2])
Delimiter1将文本分成K-V对,Delimiter2分割每个K-V对。
对于delimiter1默认分隔符是’,’,对于delimiter2默认分隔符是’=’
select cl_name,str_to_map(cl_name,'\\|*\\|',':')['scene_ids'] as scene_ids from tab
结果:
scene_ids
6,0
方法三:使用split分隔函数
split(string str, string pat)
返回值: array 说明: 按照pat字符串分割str,会返回分割后的字符串数组
select cl_name,split(cl_name,'\\\|\\\*\\\|')[5] as scene_ids from tab
结果:
scene_ids
scene_ids:6,0
上面结果可以再使用split或str_to_map处理下
方法四:字段截取再分隔 split+substr+instr/locate
select cl_name,split(substr(cl_name,instr(cl_name,'scene_ids'),length(cl_name)),'\\\|\\\*\\\|')[0] as scene_ids from tab
结果:
scene_ids
scene_ids:6,0
json数组的解析
针对json可以使用 get_json_object、json_tuple、lateral view、explode
核心函数:get_json_object
get_json_object解析:
A limited version of JSONPath is supported: json解析支持操作
- $ : Root object 根对象
- . : Child operator 子操作符
- [] : Subscript operator for array 数组下标操作符
- : Wildcard for [] 数组通配符
Syntax not supported that’s worth noticing: 不支持语法
- : Zero length string as key 空字符串为key
- … : Recursive descent 递归下降
- @ : Current object/element 当前对象/元素
- () : Script expression 脚本表达式
- ?() : Filter (script) expression. 筛选脚本表达式
- [,] : Union operator 联合操作符
- [start:end.step] : array slice operator 数组切片操作
json数组get_json_object使用
{
"base_1":[
{
"penalty_1":0,
"mile_1":0,
"penalty_A_1":0,
"exchange_1":1,
"TypeDesc":"免费取消",
"beginTime":"1970-01-01 00:00:00",
"endTime":"2021-06-04 20:00:00",
"Type_s":0
},
{
"penalty_1":79,
"mile_1":0,
"penalty_A_1":79,
"exchange_1":1,
"TypeDesc":"不可取消",
"beginTime":"2021-06-04 20:00:00",
"endTime":"2099-12-31 00:00:00",
"Type_s":2
}
],
"base_2":[
{
"penalty_1":0,
"mile_1":0,
"penalty_A_1":0,
"exchange_1":1,
"TypeDesc":"免费取消",
"beginTime":"1970-01-01 00:00:00",
"endTime":"2021-06-04 20:00:00",
"Type_s":0
},
{
"penalty_1":71.28,
"mile_1":0,
"penalty_A_1":71.28,
"exchange_1":1,
"TypeDesc":"不可取消",
"beginTime":"2021-06-04 20:00:00",
"endTime":"2099-12-31 00:00:00",
"Type_s":2
}
]
}
生成测试表
with tab as
(select '''{"base_1":[{"penalty_1":0,"mile_1":0,"penalty_A_1":0,"exchange_1":1,"TypeDesc":"免费取消","beginTime":"1970-01-01 00:00:00","endTime":"2021-06-04 20:00:00","Type_s":0},{"penalty_1":79,"mile_1":0,"penalty_A_1":79,"exchange_1":1,"TypeDesc":"不可取消","beginTime":"2021-06-04 20:00:00","endTime":"2099-12-31 00:00:00","Type_s":2}],"base_2":[{"penalty_1":0,"mile_1":0,"penalty_A_1":0,"exchange_1":1,"TypeDesc":"免费取消","beginTime":"1970-01-01 00:00:00","endTime":"2021-06-04 20:00:00","Type_s":0},{"penalty_1":71.28,"mile_1":0,"penalty_A_1":71.28,"exchange_1":1,"TypeDesc":"不可取消","beginTime":"2021-06-04 20:00:00","endTime":"2099-12-31 00:00:00","Type_s":2}]}'''
as json_col
)
查询代码
select json_col
,get_json_object(json_col,'$.base_1') as col -- 解析json数组内字段
,get_json_object(json_col,'$.base_1.TypeDesc') as col_1 -- 解析json内嵌套json字段
,get_json_object(json_col,'$.base_1.TypeDesc\[0]') as col_2 -- 解析json数组数据
,col_3 -- 解析嵌套json内数组字段表处理
from tab
LATERAL VIEW explode(split(regexp_extract(get_json_object(json_col,'$.base_1.TypeDesc'),'^\\["(.*)\\"]$',1),'","')) tab as col_3
-- regexp_extract 正则字段内容去除引号
结果呈现:
json嵌套数组解析指定字段所有值
数据还是使用上述案例,查询所有TypeDesc字段,解析思路:找出所有对应的"key":“value”
- 将对应的字段 “key”:“value"中“key"作为分隔符split成数组,行成value”…的数据
- 再用数组lateral view列转行后,利用正则取value”…结果中的value值
- 确定数组的[0]无对应的value值,可以使用 where 条件 like '"%'解决
查询代码
select
json_col
,TypeDesc as mid_step -- 用字段名做key:value转化
,regexp_extract(TypeDesc,'^"(.*?(?="))',1) TypeDesc -- 获取字段值引号之间数据
from tab
LATERAL view explode(split(json_col,'TypeDesc":')) t1 as TypeDesc
where TypeDesc like '"%'
结果展示
数组json解析
数组json串:[{"a_key":6,"c_key":"v1"},{"a_key":6,"c_key":"v1"}]
数组json解析,两个思路,一种方法将数组json转换成json数组,另外一种将数组按照分隔符转换成数数组每条json解析。
测试表:
with tab as (select '[{"a_key":6,"c_key":"v1"},{"a_key":6,"c_key":"v1"}]' as cl_name)
查询语句:
select explode( -- array 行转多列
split( -- 分隔成array
regexp_replace(
regexp_replace(cl_name,'\\}\\,\\{','\\}\\;\\{') -- 将json数组},{中的,转化成};{
,'\\[|\\]','') -- 将json数组转化成json,把[]去除
,'\\;') -- split按照;分隔成array
) as cl_name from tab
结果展示
string转array
很多函数可能是针对array进行的操作比如explode,但是呢array数据同步到hdfs很多往往都是存的string的格式,比如[“我是谁?”,“我再哪?”],导致了一系列的尴尬,存在引号问题,转换成array则方便进行转行和数组函数。
查询语句
with tab as (
select '["我是谁?","我再哪?"]' as string_array
)
select split(regexp_extract(string_array,'^\\["(.*)\\"]$',1),'","') as string_array
,split(regexp_extract(string_array,'^\\["(.*)\\"]$',1),'","')[0] as col_1
,split(regexp_extract(string_array,'^\\["(.*)\\"]$',1),'","')[1] as col_2
from tab
结果展示