本文总结一些常用的字符串函数。还是在databricks社区版。
- 字符串截取函数:substr \ substring
- 字符串的长度函数 len \ length
- 字符串定位函数 instr
- 字符串分割函数 split \ split_part
- 字符串去空格函数:trim \ ltrim \ rtrim
- 字符串补足函数:lpad \ rpad
- 字符串拼接函数: concat \ concat_ ws
- 字符串替换函数: replace \ regexp_replace
- 正则表达式相关函数:regexp
- 字符串模糊匹配函数: like \ rlike
- 字符串转大小写函数: lower Icase \ upper \ ucase
- json解析函数get json_object
- 重复字符串函数:repeat
- URL解析函数:parse url
- 集合查找函数:find_in_set
- 字符串反转函数:reverse
CREATE TABLE temp (id int,name string,email string,phone string)
INSERT INTO temp VALUES
(1, 'John Doe', 'john.doe@example.com', '123-456-7890'),
(2, 'Jane Smith', 'jane.smith@example.com', '555-555-5555'),
(3, 'Bob Johnson', 'bob.johnson@example.com', '555-123-4567'),
(4, 'Alice Brown', 'alice.brown@example.com', '555-987-6543'),
(5, 'Charlie Davis', 'charlie.davis@example.com', '555-111-2222');
id | name | phone | |
1 | John Doe | john.doe@example.com | 123-456-7890 |
2 | Jane Smith | jane.smith@example.com | 555-555-5555 |
3 | Bob Johnson | bob.johnson@example.com | 555-123-4567 |
4 | Alice Brown | alice.brown@example.com | 555-987-6543 |
5 | Charlie Davis | charlie.davis@example.com | 555-111-2222 |
字符串截取函数:substr \ substring
语法1: substr(string A, int start),substring(string A, int start)
返回值: string
**说明:**返回字符串A从start位置到结尾的字符串
语法2: substr(string A, int start, int len),substring(string A, int start, int len)
返回值: string
**说明:**返回字符串A从start位置开始,长度为len的字符串
select t1.*,substr(t1.phone,1,3) as area_number
from temp as t1
id | name | phone | area_number | |
1 | John Doe | john.doe@example.com | 123-456-7890 | 123 |
2 | Jane Smith | jane.smith@example.com | 555-555-5555 | 555 |
3 | Bob Johnson | bob.johnson@example.com | 555-123-4567 | 555 |
4 | Alice Brown | alice.brown@example.com | 555-987-6543 | 555 |
5 | Charlie Davis | charlie.davis@example.com | 555-111-2222 | 555 |
substring
substring(str, pos[, len]) - Returns the substring of
str
that starts atpos
and is of lengthlen
, or the slice of byte array that starts atpos
and is of lengthlen
.substring(str FROM pos[ FOR len]]) - Returns the substring of
str
that starts atpos
and is of lengthlen
, or the slice of byte array that starts atpos
and is of lengthlen
.substring(str, pos[, len])-返回从pos开始且长度为Len的str子字符串,或从pos开始且长度为len的字节数组切片。
substring(str FROM pos[ FOR len]])-返回str的子字符串,从pos开始,是长度len,或从pos开始且长度为en的字节数组切片
select
t1.phone
,substring(t1.phone,1,3) as area_num
,substring(t1.phone,9)as tail_num
,substring(t1.phone,-4) as tail_num
,substring(t1.phone from -4) as tail_num
,substring(t1.phone from 5 for 1) as mid_num -- 从第5位开始的第一个
,substring(t1.phone from 5 for 3) as mid_num
from temp as t1
phone | area_num | tail_num | tail_num | tail_num | mid_num | mid_num |
123-456-7890 | 123 | 7890 | 7890 | 7890 | 4 | 456 |
555-555-5555 | 555 | 5555 | 5555 | 5555 | 5 | 555 |
555-123-4567 | 555 | 4567 | 4567 | 4567 | 1 | 123 |
555-987-6543 | 555 | 6543 | 6543 | 6543 | 9 | 987 |
555-111-2222 | 555 | 2222 | 2222 | 2222 | 1 | 111 |
substring_index
substring_index(str, delim, count) - Returns the substring from
str
beforecount
occurrences of the delimiterdelim
. Ifcount
is positive, everything to the left of the final delimiter (counting from the left) is returned. Ifcount
is negative, everything to the right of the final delimiter (counting from the right) is returned. The function substring_index performs a case-sensitive match when searching fordelim
.substring_index(str, Delim,count)-返回str中的子串,然后才出现分隔符的count。如果count为正,则返回分隔符左侧的所有内容(从左侧开始计数)。如果count为负,则返回最终分隔符右侧的所有内容(从右侧开始计数)。该函数substring_index在搜索Delim时执行区分大小写的匹配。
select
t1.email
,substring_index(t1.email,'.',1) as name
,substring_index(t1.email,'.',-1) as com
,substring_index(t1.email,'@',1) as before
,substring_index(t1.email,'@',-1) as after
from temp as t1
name | com | before | after | |
john.doe@example.com | john | com | john.doe | |
jane.smith@example.com | jane | com | jane.smith | |
bob.johnson@example.com | bob | com | bob.johnson | |
alice.brown@example.com | alice | com | alice.brown | |
charlie.davis@example.com | charlie | com | charlie.davis |
SELECT substring_index('www.apache.org', '.', 2)as wangzhi
-- www.apache
字符串的长度函数 len \ length
select
name
,len(name) as l_name
,length(name) as l_name
from temp
name | l_name | l_name |
John Doe | 8 | 8 |
Jane Smith | 10 | 10 |
Bob Johnson | 11 | 11 |
Alice Brown | 11 | 11 |
Charlie Davis | 13 | 13 |
字符串定位函数 instr
-- 注意细节
select
email
,instr(email,'@') as in_a
,len(email) as l_email
,substr(email,instr(email,'@')) as e_before
,substr(email,instr(email,'@'),len(email)) as e_before
,substr(email,instr(email,'@')+1,len(email)) as e_com
,substr(email,1,instr(email,'@')-1) as e_after
from temp
in_a | l_email | e_before | e_before | e_com | e_after | |
john.doe@example.com | 9 | 20 | @example.com | @example.com | john.doe | |
jane.smith@example.com | 11 | 22 | @example.com | @example.com | jane.smith | |
bob.johnson@example.com | 12 | 23 | @example.com | @example.com | bob.johnson | |
alice.brown@example.com | 12 | 23 | @example.com | @example.com | alice.brown | |
charlie.davis@example.com | 14 | 25 | @example.com | @example.com | charlie.davis |
字符串分割 split \ split_part
split(str, regex, limit) - Splits
str
around occurrences that matchregex
and returns an array with a length of at mostlimit
Arguments:
- str - a string expression to split.
- regex - a string representing a regular expression. The regex string should be a Java regular expression.
- limit - an integer expression which controls the number of times the regex is applied.
- limit > 0: The resulting array’s length will not be more than
limit
, and the resulting array’s last entry will contain all input beyond the last matched regex.- limit <= 0:
regex
will be applied as many times as possible, and the resulting array can be of any size.拆分(str, regex,limited)-将str拆分为与regex匹配的事件,并返回一个长度最多为限制的数组
论据:
- str-要拆分的字符串表达式。
- 正则表达式-表示正则表达式的字符串。正则表达式字符串应该是一个Java的正则表达式。
- 限制-控制应用正则表达式的次数的整数表达式。
- 限制>0:结果数组的长度不会超过限制,并且结果数组的最后一个条目将包含超过最后一个匹配正则表达式的所有输入。
- limited<=0:正则表达式将被应用尽可能多的次数,并且生成的数组可以是任何大小。
split
select
email
,split(email,'example.') as e_before
,split(email,'@') as e_before
from temp12
e_before | e_before | |
john.doe@example.com | [“john.doe@”,“com”] | [“john.doe”,“example.com”] |
jane.smith@example.com | [“jane.smith@”,“com”] | [“jane.smith”,“example.com”] |
bob.johnson@example.com | [“bob.johnson@”,“com”] | [“bob.johnson”,“example.com”] |
alice.brown@example.com | [“alice.brown@”,“com”] | [“alice.brown”,“example.com”] |
charlie.davis@example.com | [“charlie.davis@”,“com”] | [“charlie.davis”,“example.com”] |
select
split('001_prod_7532_20230510','_',1) as e_1
,split('001_prod_7532_20230510','_',2) as e_2
,split('001_prod_7532_20230510','_',3) as e_3
-- 但是不是我想要的形式
e_1 | e_2 | e_3 |
[“001_prod_7532_20230510”] | [“001”,“prod_7532_20230510”] | [“001”,“prod”,“7532_20230510”] |
split_part
split_part(str, delimiter, partNum) - Splits str by delimiter and return requested part of the split (1-based). If any input is null, returns null. if partNum is out of range of split parts, returns empty string. If partNum is 0, throws an error. If partNum is negative, the parts are counted backward from the end of the string. If the delimiter is an empty string, the str is not split.
split_part(str, delimiter, partNum) -按分隔符拆分str并返回拆分的请求部分(基于1)。如果任何输入为空,则返回空。如果partNum超出分割部分的范围,则返回空字符串。如果partNum为0,则抛出错误。如果partNum为负数,则从字符串末尾开始向后计数。如果分隔符是空字符串,则不拆分该str。
-- 这下就对了
select
email
,split_part(email,'.',1) as e_before
,split_part(email,'.',2) as e_before
from temp12
e_before | e_before | |
john.doe@example.com | john | doe@example |
jane.smith@example.com | jane | smith@example |
bob.johnson@example.com | bob | johnson@example |
alice.brown@example.com | alice | brown@example |
charlie.davis@example.com | charlie | davis@example |
select
split_part('001_prod_7532_20230510','_',1) as e_1
,split_part('001_prod_7532_20230510','_',2) as e_2
,split_part('001_prod_7532_20230510','_',3) as e_3
e_1 | e_2 | e_3 |
001 | prod | 7532 |
去空格函数:trim \ ltrim \ rtrim
trim(str) - Removes the leading and trailing space characters from
str
.trim(BOTH FROM str) - Removes the leading and trailing space characters from
str
.trim(LEADING FROM str) - Removes the leading space characters from
str
.trim(TRAILING FROM str) - Removes the trailing space characters from
str
.trim(trimStr FROM str) - Remove the leading and trailing
trimStr
characters fromstr
.trim(BOTH trimStr FROM str) - Remove the leading and trailing
trimStr
characters fromstr
.trim(LEADING trimStr FROM str) - Remove the leading
trimStr
characters fromstr
.trim(TRAILING trimStr FROM str) - Remove the trailing
trimStr
characters fromstr
.Arguments:
- str - a string expression
- trimStr - the trim string characters to trim, the default value is a single space
- BOTH, FROM - these are keywords to specify trimming string characters from both ends of the string
- LEADING, FROM - these are keywords to specify trimming string characters from the left end of the string
- TRAILING, FROM - these are keywords to specify trimming string characters from the right end of the string
trim(str) -从str中删除前后空格字符
trim(BOTH FROM str) -从str中删除前导和尾随空格字符。
trim(LEADING FROM str) -从str中删除前导空格字符
trim(TRAILING FROM str) -从str中删除尾随空格字符
trim(trimStr FROM str) -从str中删除前后的trimStr字符
trim(BOTH trimStr FROM str) -从str中删除开头和结尾的trinstr字符。
trim(LEADING trimStr FROM str) -从str中删除前导trimStr字符
trim(TRAILING trimStr FROM str) -从str中删除尾随的trinstr字符
参数:
•STR—字符串表达式
•trim Str -要修剪的字符串字符,默认值是一个空格•BOTH, FROM -这些关键字用于指定从字符串的两端修剪字符串字符
•LEADING, FROM -这些关键字用于指定从字符串的左端修剪字符串字符
•TRAILING, FROM -这些关键字用于指定从字符串的右端修剪字符串字符
trim
select
lizi
,length(lizi) as l_lizi
,trim(lizi) as lizi1
,ltrim(lizi) as l_trim
,rtrim(lizi) as r_trim
,len(ltrim(lizi)) as l_trim_l
,len(rtrim(lizi)) as r_trim_l
,trim(both from lizi) as lizi2
,trim(LEADING from lizi) as lizi3
,trim(TRAILING from lizi) as lizi4
from
(
select ' SparkSQLHive ' as lizi
union all
select ' SparkSQLHive ' as lizi
union all
select ' SparkSQLHive ' as lizi
union all
select ' SparkSQLHive ' as lizi
)
lizi | l_lizi | lizi1 | l_trim | r_trim | l_trim_l | r_trim_l | lizi2 | lizi3 | lizi4 |
SparkSQLHive | 15 | SparkSQLHive | SparkSQLHive | SparkSQLHive | 14 | 13 | SparkSQLHive | SparkSQLHive | SparkSQLHive |
SparkSQLHive | 15 | SparkSQLHive | SparkSQLHive | SparkSQLHive | 14 | 13 | SparkSQLHive | SparkSQLHive | SparkSQLHive |
SparkSQLHive | 15 | SparkSQLHive | SparkSQLHive | SparkSQLHive | 14 | 13 | SparkSQLHive | SparkSQLHive | SparkSQLHive |
SparkSQLHive | 15 | SparkSQLHive | SparkSQLHive | SparkSQLHive | 14 | 13 | SparkSQLHive | SparkSQLHive | SparkSQLHive |
select
lizi
,trim('SL' from lizi) as lizi5
,trim(LEADING 'SL' from lizi) as lizi6
,trim(TRAILING 'SL' from lizi) as lizi7
from
(
select 'SSparkSQLS' as lizi
)
lizi | lizi5 | lizi6 | lizi7 |
SSparkSQLS | parkSQ | parkSQLS | SSparkSQ |
补足函数:lpad \ rpad
lpad(str, len[, pad]) - Returns
str
, left-padded withpad
to a length oflen
. Ifstr
is longer thanlen
, the return value is shortened tolen
characters or bytes. Ifpad
is not specified,str
will be padded to the left with space characters if it is a character string, and with zeros if it is a byte sequence.lpad(str, lenl,padl)-返回str,用pad左填充到len的长度如果str比len长,则返回值缩短为len字符或字节。如果未指定pad,如果是字符串,str将在左侧填充空格字符,如果是字节序列,则填充零。
select
name
,lpad(name,20,'20230520') as l_name
,rpad(name,20,'20230520') as r_name
,len(lpad(name,20,'20230520')) as len_name
from
temp12
拼接函数 concat \ concat_ws
concat
concat(col1, col2, …, colN) - Returns the concatenation of col1, col2, …, colN.
Concat (col1,col2,. a,colN)-返回 col1,col2,. ,CoIN 的串联。
concat_ws
concat_ws(sep[, str | array(str)]+) - Returns the concatenation of the strings separated by
sep
.Concat _ ws (sepl,str l array (str) J +)-返回由 sep 分隔的字符串的串联
select
id,name
,concat(id,name) as id_name
,concat(id,'_',name) as id_name2
,concat_ws(id,'_',name) as id_name3
,concat_ws('_',id,name) as id_name4
,id||'_'||name as id_name5
from
temp12
id | name | id_name | id_name2 | id_name3 | id_name4 | id_name5 |
1 | John Doe | 1John Doe | 1_John Doe | _1John Doe | 1_John Doe | 1_John Doe |
2 | Jane Smith | 2Jane Smith | 2_Jane Smith | _2Jane Smith | 2_Jane Smith | 2_Jane Smith |
3 | Bob Johnson | 3Bob Johnson | 3_Bob Johnson | _3Bob Johnson | 3_Bob Johnson | 3_Bob Johnson |
4 | Alice Brown | 4Alice Brown | 4_Alice Brown | _4Alice Brown | 4_Alice Brown | 4_Alice Brown |
5 | Charlie Davis | 5Charlie Davis | 5_Charlie Davis | _5Charlie Davis | 5_Charlie Davis | 5_Charlie Davis |
替换函数 replace \ regexp_replace
replace(str, search[, replace]) - Replaces all occurrences of
search
withreplace
.Arguments:
- str - a string expression
- search - a string expression. If
search
is not found instr
,str
is returned unchanged.- replace - a string expression. If
replace
is not specified or is an empty string, nothing replaces the string that is removed fromstr
.replace(str, search[, replace]) -用replace替换所有搜索
参数:
•str -一个字符串表达式
•search—字符串表达式。如果在str中没有找到搜索,则返回不变的str。
•replace -一个字符串表达式。如果replace未指定或为空字符串,则不会替换从str中删除的字符串
select
email
,replace(email,'@') as e_12
,replace(email,'@','') as e_12
,replace(email,'@','#') as e_13
,regexp_replace(email,'@','#') as e_14
from
temp12
e_12 | e_12 | e_13 | e_14 | |
john.doe@example.com | john.doe#example.com | john.doe#example.com | ||
jane.smith@example.com | jane.smith#example.com | jane.smith#example.com | ||
bob.johnson@example.com | bob.johnson#example.com | bob.johnson#example.com | ||
alice.brown@example.com | alice.brown#example.com | alice.brown#example.com | ||
charlie.davis@example.com | charlie.davis#example.com | charlie.davis#example.com |
正则表达式相关函数:regexp
CREATE TABLE example_data (
id INT,
name VARCHAR(50),
qq_email VARCHAR(50),
website VARCHAR(50)
);
INSERT INTO example_data VALUES
(1, 'John Smith', 'john.smith@qq.com', 'www.example.com'),
(2, 'Jane Doe', 'john.smith@qq.com', 'www.example2.com'),
(3, 'Bob Johnson', 'bob.johnson@qq.com', 'www.example2.com'),
(4, 'Alice Brown', 'alice.brown@qq.com', 'www.example.com'),
(5, 'Charlie Green', 'alice.brown@qq.com', 'www.example.com'),
(6, 'David Lee', 'david.lee@qq.com', 'www.example2.com');
select * from example_data
id | name | qq_email | website |
1 | John Smith | john.smith@qq.com | |
2 | Jane Doe | john.smith@qq.com | |
3 | Bob Johnson | bob.johnson@qq.com | |
4 | Alice Brown | alice.brown@qq.com | |
5 | Charlie Green | alice.brown@qq.com | |
6 | David Lee | david.lee@qq.com |
regexp
SELECT * FROM example_data
WHERE regexp(website, '.*example2.*');
id | name | qq_email | website |
2 | Jane Doe | john.smith@qq.com | |
3 | Bob Johnson | bob.johnson@qq.com | |
6 | David Lee | david.lee@qq.com |
SELECT * FROM example_data
WHERE regexp(name, '^J.*');
id | name | qq_email | website |
1 | John Smith | john.smith@qq.com | |
2 | Jane Doe | john.smith@qq.com |
正则表达式替换函数:regexp_replace
语法: regexp_replace(string A, string B, string C)
返回值: string
**说明:**将字符串A中的符合java正则表达式B的部分替换为C。注意,在有些情况下要使用转义字符,类似oracle中的regexp_replace函数
regexp_count
SELECT id, qq_email, regexp_count(qq_email, '.*qq.*') AS count
FROM example_data
WHERE regexp(qq_email, '.*qq.*')
ORDER BY count DESC;
id | qq_email | count |
1 | john.smith@qq.com | 1 |
2 | john.smith@qq.com | 1 |
3 | bob.johnson@qq.com | 1 |
4 | alice.brown@qq.com | 1 |
5 | alice.brown@qq.com | 1 |
6 | david.lee@qq.com | 1 |
正则表达式解析函数:regexp_extract
SELECT qq_email,
regexp_extract(qq_email, '^(.+)@', 1) AS username,
regexp_extract(qq_email, '@(.+)$', 1) AS domain
FROM example_data;
qq_email | username | domain |
john.smith@qq.com | john.smith | |
john.smith@qq.com | john.smith | |
bob.johnson@qq.com | bob.johnson | |
alice.brown@qq.com | alice.brown | |
alice.brown@qq.com | alice.brown | |
david.lee@qq.com | david.lee |
SELECT qq_email,
regexp_instr(qq_email, 'o') AS position
FROM example_data
qq_email | position |
john.smith@qq.com | 2 |
john.smith@qq.com | 2 |
bob.johnson@qq.com | 2 |
alice.brown@qq.com | 9 |
alice.brown@qq.com | 9 |
david.lee@qq.com | 15 |
SELECT qq_email
FROM example_data
WHERE regexp_like(qq_email, 'smith');
qq_email |
john.smith@qq.com |
john.smith@qq.com |
模糊匹配函数 like / rlike
SELECT *
FROM example_data
WHERE qq_email LIKE '%smith@qq.com';
id | name | qq_email | website |
1 | John Smith | john.smith@qq.com | |
2 | Jane Doe | john.smith@qq.com |
SELECT *
FROM example_data
WHERE qq_email RLIKE 'brown';
id | name | qq_email | website |
4 | Alice Brown | alice.brown@qq.com | |
5 | Charlie Green | alice.brown@qq.com |
字符串转大小写函数 lower \ lcase \ upper \ ucase
select
name
,lower(name) as l_name
,lcase(name) as l_name
,upper(name) as u_name
,ucase(name) as u_name
from example_data
name | l_name | l_name | u_name | u_name |
John Smith | john smith | john smith | JOHN SMITH | JOHN SMITH |
Jane Doe | jane doe | jane doe | JANE DOE | JANE DOE |
Bob Johnson | bob johnson | bob johnson | BOB JOHNSON | BOB JOHNSON |
Alice Brown | alice brown | alice brown | ALICE BROWN | ALICE BROWN |
Charlie Green | charlie green | charlie green | CHARLIE GREEN | CHARLIE GREEN |
David Lee | david lee | david lee | DAVID LEE | DAVID LEE |
json解析函数get_json_object
get_json_object(json_txt, path) - Extracts a json object from
path
.get_json_object(json_txt, path) -从path中提取一个json对象
SELECT get_json_object('{"name": "John", "age": 30}', '$.name') as name;
name |
John |
重复字符串函数:repeat
select name,repeat(name,2) as name2
from temp12
name | name2 |
John Doe | John DoeJohn Doe |
Jane Smith | Jane SmithJane Smith |
Bob Johnson | Bob JohnsonBob Johnson |
Alice Brown | Alice BrownAlice Brown |
Charlie Davis | Charlie DavisCharlie Davis |
URL解析函数:parse_url
https://learn.microsoft.com/zh-cn/azure/databricks/sql/language-manual/functions/parse_url
> SELECT parse_url('http://spark.apache.org/path?query=1', 'HOST');
spark.apache.org
> SELECT parse_url('http://spark.apache.org/path?query=1', 'QUERY');
query=1
> SELECT parse_url('http://spark.apache.org/path?query=1', 'QUERY', 'query');
1
> SELECT parse_url('http://spark. apache.org/path?query=1', 'QUERY', 'query');
Error: Illegal argument
集合查找函数: find_in_set
find_in_set(str, str_array) - Returns the index (1-based) of the given string (
str
) in the comma-delimited list (str_array
). Returns 0, if the string was not found or if the given string (str
) contains a comma.find_in_set(str,str_array)-返回逗号分隔列表(str_array)中给定字符串(str)的索引(基于1)。如果未找到字符串或给定字符串(str)包含逗号,则返回O。
语法: find_in_set(string str, string strList)
返回值: int
说明: 返回str在strlist第一次出现的位置,strlist是用逗号分割的字符串。如果没有找该str字符,则返回0
**使用:**select find_in_set(‘ab’,‘ef,ab,de’) from student limit 1;
2
> SELECT find_in_set(‘ab’,‘abc,b,ab,c,def’);
3
字符串反转函数:reverse
select name,reverse(name) as r_name
from temp12
name | r_name |
John Doe | eoD nhoJ |
Jane Smith | htimS enaJ |
Bob Johnson | nosnhoJ boB |
Alice Brown | nworB ecilA |
Charlie Davis | sivaD eilrahC |