本文总结一些常用的字符串函数。还是在databricks社区版。

  • 字符串截取函数:substr \ substring
  • 字符串的长度函数 len \ length
  • 字符串定位函数 instr
  • 字符串分割函数 split \ split_part
  • 字符串去空格函数:trim \ ltrim \ rtrim
  • 字符串补足函数:lpad \ rpad
  • 字符串拼接函数: concat \ concat_ ws
  • 字符串替换函数: replace \ regexp_replace
  • 正则表达式相关函数:regexp
  • 字符串模糊匹配函数: like \ rlike
  • 字符串转大小写函数: lower Icase \ upper \ ucase
  • json解析函数get json_object
  • 重复字符串函数:repeat
  • URL解析函数:parse url
  • 集合查找函数:find_in_set
  • 字符串反转函数:reverse
CREATE TABLE temp (id int,name string,email string,phone string)
INSERT INTO temp VALUES
  (1, 'John Doe', 'john.doe@example.com', '123-456-7890'),
  (2, 'Jane Smith', 'jane.smith@example.com', '555-555-5555'),
  (3, 'Bob Johnson', 'bob.johnson@example.com', '555-123-4567'),
  (4, 'Alice Brown', 'alice.brown@example.com', '555-987-6543'),
  (5, 'Charlie Davis', 'charlie.davis@example.com', '555-111-2222');

id

name

email

phone

1

John Doe

john.doe@example.com

123-456-7890

2

Jane Smith

jane.smith@example.com

555-555-5555

3

Bob Johnson

bob.johnson@example.com

555-123-4567

4

Alice Brown

alice.brown@example.com

555-987-6543

5

Charlie Davis

charlie.davis@example.com

555-111-2222

字符串截取函数:substr \ substring

语法1: substr(string A, int start),substring(string A, int start)

返回值: string

**说明:**返回字符串A从start位置到结尾的字符串

语法2: substr(string A, int start, int len),substring(string A, int start, int len)

返回值: string

**说明:**返回字符串A从start位置开始,长度为len的字符串

select t1.*,substr(t1.phone,1,3) as area_number
from temp as t1

id

name

email

phone

area_number

1

John Doe

john.doe@example.com

123-456-7890

123

2

Jane Smith

jane.smith@example.com

555-555-5555

555

3

Bob Johnson

bob.johnson@example.com

555-123-4567

555

4

Alice Brown

alice.brown@example.com

555-987-6543

555

5

Charlie Davis

charlie.davis@example.com

555-111-2222

555

substring

substring(str, pos[, len]) - Returns the substring of str that starts at pos and is of length len, or the slice of byte array that starts at pos and is of length len.

substring(str FROM pos[ FOR len]]) - Returns the substring of str that starts at pos and is of length len, or the slice of byte array that starts at pos and is of length len.

substring(str, pos[, len])-返回从pos开始且长度为Len的str子字符串,或从pos开始且长度为len的字节数组切片。

substring(str FROM pos[ FOR len]])-返回str的子字符串,从pos开始,是长度len,或从pos开始且长度为en的字节数组切片

select 
   t1.phone
  ,substring(t1.phone,1,3) as area_num
  ,substring(t1.phone,9)as tail_num
  ,substring(t1.phone,-4) as tail_num
  ,substring(t1.phone from -4) as tail_num
  ,substring(t1.phone from 5 for 1) as mid_num -- 从第5位开始的第一个
  ,substring(t1.phone from 5 for 3) as mid_num 
from temp as t1

phone

area_num

tail_num

tail_num

tail_num

mid_num

mid_num

123-456-7890

123

7890

7890

7890

4

456

555-555-5555

555

5555

5555

5555

5

555

555-123-4567

555

4567

4567

4567

1

123

555-987-6543

555

6543

6543

6543

9

987

555-111-2222

555

2222

2222

2222

1

111

substring_index

substring_index(str, delim, count) - Returns the substring from str before count occurrences of the delimiter delim. If count is positive, everything to the left of the final delimiter (counting from the left) is returned. If count is negative, everything to the right of the final delimiter (counting from the right) is returned. The function substring_index performs a case-sensitive match when searching for delim.

substring_index(str, Delim,count)-返回str中的子串,然后才出现分隔符的count。如果count为正,则返回分隔符左侧的所有内容(从左侧开始计数)。如果count为负,则返回最终分隔符右侧的所有内容(从右侧开始计数)。该函数substring_index在搜索Delim时执行区分大小写的匹配。

select 
   t1.email
  ,substring_index(t1.email,'.',1) as name
  ,substring_index(t1.email,'.',-1) as com
  ,substring_index(t1.email,'@',1) as before
  ,substring_index(t1.email,'@',-1) as after
from temp as t1

email

name

com

before

after

john.doe@example.com

john

com

john.doe

example.com

jane.smith@example.com

jane

com

jane.smith

example.com

bob.johnson@example.com

bob

com

bob.johnson

example.com

alice.brown@example.com

alice

com

alice.brown

example.com

charlie.davis@example.com

charlie

com

charlie.davis

example.com

SELECT substring_index('www.apache.org', '.', 2)as wangzhi
-- www.apache

字符串的长度函数 len \ length

select 
   name
  ,len(name) as l_name
  ,length(name) as l_name
from temp

name

l_name

l_name

John Doe

8

8

Jane Smith

10

10

Bob Johnson

11

11

Alice Brown

11

11

Charlie Davis

13

13

字符串定位函数 instr

-- 注意细节
select 
    email
   ,instr(email,'@') as in_a
   ,len(email) as l_email
   ,substr(email,instr(email,'@')) as e_before
   ,substr(email,instr(email,'@'),len(email)) as e_before
   ,substr(email,instr(email,'@')+1,len(email)) as e_com
   ,substr(email,1,instr(email,'@')-1) as e_after
from temp

email

in_a

l_email

e_before

e_before

e_com

e_after

john.doe@example.com

9

20

@example.com

@example.com

example.com

john.doe

jane.smith@example.com

11

22

@example.com

@example.com

example.com

jane.smith

bob.johnson@example.com

12

23

@example.com

@example.com

example.com

bob.johnson

alice.brown@example.com

12

23

@example.com

@example.com

example.com

alice.brown

charlie.davis@example.com

14

25

@example.com

@example.com

example.com

charlie.davis

字符串分割 split \ split_part

split(str, regex, limit) - Splits str around occurrences that match regex and returns an array with a length of at most limit

Arguments:

  • str - a string expression to split.
  • regex - a string representing a regular expression. The regex string should be a Java regular expression.
  • limit - an integer expression which controls the number of times the regex is applied.
  • limit > 0: The resulting array’s length will not be more than limit, and the resulting array’s last entry will contain all input beyond the last matched regex.
  • limit <= 0: regex will be applied as many times as possible, and the resulting array can be of any size.

拆分(str, regex,limited)-将str拆分为与regex匹配的事件,并返回一个长度最多为限制的数组

论据:

  • str-要拆分的字符串表达式。
  • 正则表达式-表示正则表达式的字符串。正则表达式字符串应该是一个Java的正则表达式。
  • 限制-控制应用正则表达式的次数的整数表达式。
  • 限制>0:结果数组的长度不会超过限制,并且结果数组的最后一个条目将包含超过最后一个匹配正则表达式的所有输入。
  • limited<=0:正则表达式将被应用尽可能多的次数,并且生成的数组可以是任何大小。
split
select 
    email
   ,split(email,'example.') as e_before
   ,split(email,'@') as e_before
from temp12

email

e_before

e_before

john.doe@example.com

[“john.doe@”,“com”]

[“john.doe”,“example.com”]

jane.smith@example.com

[“jane.smith@”,“com”]

[“jane.smith”,“example.com”]

bob.johnson@example.com

[“bob.johnson@”,“com”]

[“bob.johnson”,“example.com”]

alice.brown@example.com

[“alice.brown@”,“com”]

[“alice.brown”,“example.com”]

charlie.davis@example.com

[“charlie.davis@”,“com”]

[“charlie.davis”,“example.com”]

select 
    split('001_prod_7532_20230510','_',1) as e_1
   ,split('001_prod_7532_20230510','_',2) as e_2
   ,split('001_prod_7532_20230510','_',3) as e_3
 -- 但是不是我想要的形式

e_1

e_2

e_3

[“001_prod_7532_20230510”]

[“001”,“prod_7532_20230510”]

[“001”,“prod”,“7532_20230510”]

split_part

split_part(str, delimiter, partNum) - Splits str by delimiter and return requested part of the split (1-based). If any input is null, returns null. if partNum is out of range of split parts, returns empty string. If partNum is 0, throws an error. If partNum is negative, the parts are counted backward from the end of the string. If the delimiter is an empty string, the str is not split.

split_part(str, delimiter, partNum) -按分隔符拆分str并返回拆分的请求部分(基于1)。如果任何输入为空,则返回空。如果partNum超出分割部分的范围,则返回空字符串。如果partNum为0,则抛出错误。如果partNum为负数,则从字符串末尾开始向后计数。如果分隔符是空字符串,则不拆分该str。

-- 这下就对了
select 
    email
   ,split_part(email,'.',1) as e_before
   ,split_part(email,'.',2) as e_before
from temp12

email

e_before

e_before

john.doe@example.com

john

doe@example

jane.smith@example.com

jane

smith@example

bob.johnson@example.com

bob

johnson@example

alice.brown@example.com

alice

brown@example

charlie.davis@example.com

charlie

davis@example

select 
    split_part('001_prod_7532_20230510','_',1) as e_1
   ,split_part('001_prod_7532_20230510','_',2) as e_2
   ,split_part('001_prod_7532_20230510','_',3) as e_3

e_1

e_2

e_3

001

prod

7532

去空格函数:trim \ ltrim \ rtrim

trim(str) - Removes the leading and trailing space characters from str.

trim(BOTH FROM str) - Removes the leading and trailing space characters from str.

trim(LEADING FROM str) - Removes the leading space characters from str.

trim(TRAILING FROM str) - Removes the trailing space characters from str.

trim(trimStr FROM str) - Remove the leading and trailing trimStr characters from str.

trim(BOTH trimStr FROM str) - Remove the leading and trailing trimStr characters from str.

trim(LEADING trimStr FROM str) - Remove the leading trimStr characters from str.

trim(TRAILING trimStr FROM str) - Remove the trailing trimStr characters from str.

Arguments:

  • str - a string expression
  • trimStr - the trim string characters to trim, the default value is a single space
  • BOTH, FROM - these are keywords to specify trimming string characters from both ends of the string
  • LEADING, FROM - these are keywords to specify trimming string characters from the left end of the string
  • TRAILING, FROM - these are keywords to specify trimming string characters from the right end of the string

trim(str) -从str中删除前后空格字符
trim(BOTH FROM str) -从str中删除前导和尾随空格字符。
trim(LEADING FROM str) -从str中删除前导空格字符
trim(TRAILING FROM str) -从str中删除尾随空格字符
trim(trimStr FROM str) -从str中删除前后的trimStr字符
trim(BOTH trimStr FROM str) -从str中删除开头和结尾的trinstr字符。
trim(LEADING trimStr FROM str) -从str中删除前导trimStr字符
trim(TRAILING trimStr FROM str) -从str中删除尾随的trinstr字符
参数:
•STR—字符串表达式
•trim Str -要修剪的字符串字符,默认值是一个空格

•BOTH, FROM -这些关键字用于指定从字符串的两端修剪字符串字符
•LEADING, FROM -这些关键字用于指定从字符串的左端修剪字符串字符
•TRAILING, FROM -这些关键字用于指定从字符串的右端修剪字符串字符

trim
select 
   lizi
  ,length(lizi) as l_lizi
  ,trim(lizi) as lizi1
  ,ltrim(lizi) as l_trim
  ,rtrim(lizi) as r_trim
  ,len(ltrim(lizi)) as l_trim_l
  ,len(rtrim(lizi)) as r_trim_l
  ,trim(both from lizi) as lizi2
  ,trim(LEADING from lizi) as lizi3
  ,trim(TRAILING from lizi) as lizi4
from 
(
  select ' SparkSQLHive  ' as lizi
  union all 
  select ' SparkSQLHive  ' as lizi
  union all 
  select ' SparkSQLHive  ' as lizi
  union all 
  select ' SparkSQLHive  ' as lizi
)

lizi

l_lizi

lizi1

l_trim

r_trim

l_trim_l

r_trim_l

lizi2

lizi3

lizi4

SparkSQLHive

15

SparkSQLHive

SparkSQLHive

SparkSQLHive

14

13

SparkSQLHive

SparkSQLHive

SparkSQLHive

SparkSQLHive

15

SparkSQLHive

SparkSQLHive

SparkSQLHive

14

13

SparkSQLHive

SparkSQLHive

SparkSQLHive

SparkSQLHive

15

SparkSQLHive

SparkSQLHive

SparkSQLHive

14

13

SparkSQLHive

SparkSQLHive

SparkSQLHive

SparkSQLHive

15

SparkSQLHive

SparkSQLHive

SparkSQLHive

14

13

SparkSQLHive

SparkSQLHive

SparkSQLHive

select 
   lizi
  ,trim('SL' from lizi) as lizi5
  ,trim(LEADING 'SL' from lizi) as lizi6
  ,trim(TRAILING 'SL' from lizi) as lizi7
from
  (
    select 'SSparkSQLS' as lizi
  )

lizi

lizi5

lizi6

lizi7

SSparkSQLS

parkSQ

parkSQLS

SSparkSQ

补足函数:lpad \ rpad

lpad(str, len[, pad]) - Returns str, left-padded with pad to a length of len. If str is longer than len, the return value is shortened to len characters or bytes. If pad is not specified, str will be padded to the left with space characters if it is a character string, and with zeros if it is a byte sequence.

lpad(str, lenl,padl)-返回str,用pad左填充到len的长度如果str比len长,则返回值缩短为len字符或字节。如果未指定pad,如果是字符串,str将在左侧填充空格字符,如果是字节序列,则填充零。

select 
   name
  ,lpad(name,20,'20230520') as l_name
  ,rpad(name,20,'20230520') as r_name
  ,len(lpad(name,20,'20230520')) as len_name
from 
  temp12

拼接函数 concat \ concat_ws

concat

concat(col1, col2, …, colN) - Returns the concatenation of col1, col2, …, colN.

Concat (col1,col2,. a,colN)-返回 col1,col2,. ,CoIN 的串联。

concat_ws

concat_ws(sep[, str | array(str)]+) - Returns the concatenation of the strings separated by sep.

Concat _ ws (sepl,str l array (str) J +)-返回由 sep 分隔的字符串的串联

select 
  id,name
  ,concat(id,name) as id_name
  ,concat(id,'_',name) as id_name2
  ,concat_ws(id,'_',name) as id_name3
  ,concat_ws('_',id,name) as id_name4
  ,id||'_'||name as id_name5
from 
  temp12

id

name

id_name

id_name2

id_name3

id_name4

id_name5

1

John Doe

1John Doe

1_John Doe

_1John Doe

1_John Doe

1_John Doe

2

Jane Smith

2Jane Smith

2_Jane Smith

_2Jane Smith

2_Jane Smith

2_Jane Smith

3

Bob Johnson

3Bob Johnson

3_Bob Johnson

_3Bob Johnson

3_Bob Johnson

3_Bob Johnson

4

Alice Brown

4Alice Brown

4_Alice Brown

_4Alice Brown

4_Alice Brown

4_Alice Brown

5

Charlie Davis

5Charlie Davis

5_Charlie Davis

_5Charlie Davis

5_Charlie Davis

5_Charlie Davis

替换函数 replace \ regexp_replace

replace(str, search[, replace]) - Replaces all occurrences of search with replace.

Arguments:

  • str - a string expression
  • search - a string expression. If search is not found in str, str is returned unchanged.
  • replace - a string expression. If replace is not specified or is an empty string, nothing replaces the string that is removed from str.

replace(str, search[, replace]) -用replace替换所有搜索
参数:
•str -一个字符串表达式
•search—字符串表达式。如果在str中没有找到搜索,则返回不变的str。
•replace -一个字符串表达式。如果replace未指定或为空字符串,则不会替换从str中删除的字符串

select 
   email
  ,replace(email,'@') as e_12
  ,replace(email,'@','') as e_12
  ,replace(email,'@','#') as e_13
  ,regexp_replace(email,'@','#') as e_14
from 
  temp12

email

e_12

e_12

e_13

e_14

john.doe@example.com

john.doeexample.com

john.doeexample.com

john.doe#example.com

john.doe#example.com

jane.smith@example.com

jane.smithexample.com

jane.smithexample.com

jane.smith#example.com

jane.smith#example.com

bob.johnson@example.com

bob.johnsonexample.com

bob.johnsonexample.com

bob.johnson#example.com

bob.johnson#example.com

alice.brown@example.com

alice.brownexample.com

alice.brownexample.com

alice.brown#example.com

alice.brown#example.com

charlie.davis@example.com

charlie.davisexample.com

charlie.davisexample.com

charlie.davis#example.com

charlie.davis#example.com

正则表达式相关函数:regexp

CREATE TABLE example_data (
  id INT,
  name VARCHAR(50),
  qq_email VARCHAR(50),
  website VARCHAR(50)
);

INSERT INTO example_data VALUES
(1, 'John Smith', 'john.smith@qq.com', 'www.example.com'),
(2, 'Jane Doe', 'john.smith@qq.com', 'www.example2.com'),
(3, 'Bob Johnson', 'bob.johnson@qq.com', 'www.example2.com'),
(4, 'Alice Brown', 'alice.brown@qq.com', 'www.example.com'),
(5, 'Charlie Green', 'alice.brown@qq.com', 'www.example.com'),
(6, 'David Lee', 'david.lee@qq.com', 'www.example2.com');
select * from example_data

id

name

qq_email

website

1

John Smith

john.smith@qq.com

www.example.com

2

Jane Doe

john.smith@qq.com

www.example2.com

3

Bob Johnson

bob.johnson@qq.com

www.example2.com

4

Alice Brown

alice.brown@qq.com

www.example.com

5

Charlie Green

alice.brown@qq.com

www.example.com

6

David Lee

david.lee@qq.com

www.example2.com

regexp
SELECT * FROM example_data 
WHERE regexp(website, '.*example2.*');

id

name

qq_email

website

2

Jane Doe

john.smith@qq.com

www.example2.com

3

Bob Johnson

bob.johnson@qq.com

www.example2.com

6

David Lee

david.lee@qq.com

www.example2.com

SELECT * FROM example_data 
WHERE regexp(name, '^J.*');

id

name

qq_email

website

1

John Smith

john.smith@qq.com

www.example.com

2

Jane Doe

john.smith@qq.com

www.example2.com

正则表达式替换函数:regexp_replace

语法: regexp_replace(string A, string B, string C)

返回值: string

**说明:**将字符串A中的符合java正则表达式B的部分替换为C。注意,在有些情况下要使用转义字符,类似oracle中的regexp_replace函数

regexp_count
SELECT id, qq_email, regexp_count(qq_email, '.*qq.*') AS count
FROM example_data
WHERE regexp(qq_email, '.*qq.*')
ORDER BY count DESC;

id

qq_email

count

1

john.smith@qq.com

1

2

john.smith@qq.com

1

3

bob.johnson@qq.com

1

4

alice.brown@qq.com

1

5

alice.brown@qq.com

1

6

david.lee@qq.com

1

正则表达式解析函数:regexp_extract
SELECT qq_email, 
       regexp_extract(qq_email, '^(.+)@', 1) AS username, 
       regexp_extract(qq_email, '@(.+)$', 1) AS domain 
FROM example_data;

qq_email

username

domain

john.smith@qq.com

john.smith

qq.com

john.smith@qq.com

john.smith

qq.com

bob.johnson@qq.com

bob.johnson

qq.com

alice.brown@qq.com

alice.brown

qq.com

alice.brown@qq.com

alice.brown

qq.com

david.lee@qq.com

david.lee

qq.com

SELECT qq_email, 
       regexp_instr(qq_email, 'o') AS position 
FROM example_data

qq_email

position

john.smith@qq.com

2

john.smith@qq.com

2

bob.johnson@qq.com

2

alice.brown@qq.com

9

alice.brown@qq.com

9

david.lee@qq.com

15

SELECT qq_email 
FROM example_data
WHERE regexp_like(qq_email, 'smith');

qq_email

john.smith@qq.com

john.smith@qq.com

模糊匹配函数 like / rlike

SELECT *
FROM example_data
WHERE qq_email LIKE '%smith@qq.com';

id

name

qq_email

website

1

John Smith

john.smith@qq.com

www.example.com

2

Jane Doe

john.smith@qq.com

www.example2.com

SELECT *
FROM example_data
WHERE qq_email RLIKE 'brown';

id

name

qq_email

website

4

Alice Brown

alice.brown@qq.com

www.example.com

5

Charlie Green

alice.brown@qq.com

www.example.com

字符串转大小写函数 lower \ lcase \ upper \ ucase

select 
    name
   ,lower(name) as l_name
   ,lcase(name) as l_name
   ,upper(name) as u_name
   ,ucase(name) as u_name
from example_data

name

l_name

l_name

u_name

u_name

John Smith

john smith

john smith

JOHN SMITH

JOHN SMITH

Jane Doe

jane doe

jane doe

JANE DOE

JANE DOE

Bob Johnson

bob johnson

bob johnson

BOB JOHNSON

BOB JOHNSON

Alice Brown

alice brown

alice brown

ALICE BROWN

ALICE BROWN

Charlie Green

charlie green

charlie green

CHARLIE GREEN

CHARLIE GREEN

David Lee

david lee

david lee

DAVID LEE

DAVID LEE

json解析函数get_json_object

get_json_object(json_txt, path) - Extracts a json object from path.

get_json_object(json_txt, path) -从path中提取一个json对象

SELECT get_json_object('{"name": "John", "age": 30}', '$.name') as name;

name

John

重复字符串函数:repeat

select name,repeat(name,2) as name2
from temp12

name

name2

John Doe

John DoeJohn Doe

Jane Smith

Jane SmithJane Smith

Bob Johnson

Bob JohnsonBob Johnson

Alice Brown

Alice BrownAlice Brown

Charlie Davis

Charlie DavisCharlie Davis

URL解析函数:parse_url

https://learn.microsoft.com/zh-cn/azure/databricks/sql/language-manual/functions/parse_url

> SELECT parse_url('http://spark.apache.org/path?query=1', 'HOST');
 spark.apache.org

> SELECT parse_url('http://spark.apache.org/path?query=1', 'QUERY');
 query=1

> SELECT parse_url('http://spark.apache.org/path?query=1', 'QUERY', 'query');
 1

> SELECT parse_url('http://spark.  apache.org/path?query=1', 'QUERY', 'query');
Error: Illegal argument

集合查找函数: find_in_set

find_in_set(str, str_array) - Returns the index (1-based) of the given string (str) in the comma-delimited list (str_array). Returns 0, if the string was not found or if the given string (str) contains a comma.

find_in_set(str,str_array)-返回逗号分隔列表(str_array)中给定字符串(str)的索引(基于1)。如果未找到字符串或给定字符串(str)包含逗号,则返回O。

语法: find_in_set(string str, string strList)

返回值: int

说明: 返回str在strlist第一次出现的位置,strlist是用逗号分割的字符串。如果没有找该str字符,则返回0

**使用:**select find_in_set(‘ab’,‘ef,ab,de’) from student limit 1;

2

> SELECT find_in_set(‘ab’,‘abc,b,ab,c,def’);

3

字符串反转函数:reverse

select name,reverse(name) as r_name
from temp12

name

r_name

John Doe

eoD nhoJ

Jane Smith

htimS enaJ

Bob Johnson

nosnhoJ boB

Alice Brown

nworB ecilA

Charlie Davis

sivaD eilrahC