1. clickhouse中的窗口函数
1.1. 功能实现情况
Window Functions 在clickhouse的需求和呼声很高,早期的版本需要借助array类函数【如groupArray
, arrayJoin
, groupArrayMovingSum
, groupArrayMovingAvg functions
】实现,在v21.3版本进行了开窗函数的初步支持,但只是实验版本,在v21.9版本正式支持,可以在生产中使用。
- ClickHouse release v21.9, 2021-09-09
Mark window functions as ready for general use. Remove the allow_experimental_window_functions setting. #27184 (Alexander Kuzmenkov). - ClickHouse release v21.3 (LTS), 2021-03-12
Introduce experimental support for window functions, enabled with allow_experimental_window_functions = 1. This is a preliminary, alpha-quality implementation that is not suitable for production use and will change in backward-incompatible ways in future releases. Please see the documentation for the list of supported features. #20337 (Alexander Kuzmenkov).
目前支持或不支持的功能,详见clickhouse window-functions
1.2. 如何设置
在版本v21.9前,默认没有开启针对窗口函数的支持
1.2.1. 临时修改
使用clickhouse-client进入交互模式,执行如下命令:set allow_experimental_window_functions = 1
1.2.2. 永久修改
在users.xml配置文件中增加如下配置:<allow_experimental_window_functions>1</allow_experimental_window_functions>
通过如下SQL可以查询当前的配置select * FROM
system.settings s WHERE name like 'allow_experimental_window_functions'
2. 窗口函数
2.1.概念
窗口函数允许用户根据数据行与所谓窗口【so-called window】中的当前行之间的某种关系对数据行执行计算,所以使用窗口函数时,必须始终记住当前行。
PARTITION BY从句: 根据指定的列,将数据拆分不同的窗口进行处理
Order by 从句: 对每个窗口中的行进行排序
Frame从句:表示窗口函数对每一行计算相应的Frame,Frame总是与窗口内的当前行一起滑动,Frame由BETWEEN…AND…进行定义,依据当前行 或 与当前行的Range关系进行定义
peers【同伴】:在使用 RANGE frames时,
- 如果指定了ORDER BY子句,则当前行的peers是属于同一排序桶的行,示例见
移动求和的计算
- 如果没有指定ORDER BY子句,那么窗口分区中的所有行都被认为是当前行的peers
2.2.简单使用示例
- 移动平均线的计算: 一个序列中最后4个值的平均值 即当前行与前面3行
SELECT number, avg(number) OVER (ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS mv4 FROM values('number Int8', 1, 2, 3, 4, 5, 6, 7, 8)
- 移动求和的计算:
SELECT number, sum(number) OVER (ORDER BY number ASC RANGE BETWEEN 2 PRECEDING AND CURRENT ROW) as sumVal FROM (select array(1,7,8,4,5,3,2,6) as number) array join number
- 如果不指定order by,则窗口分区中的所有行都被认为是当前行的peers,即每行对应的Frame包含窗口中的所有行
SELECT number, sum(number) OVER (RANGE BETWEEN CURRENT ROW AND CURRENT ROW) as sumVal FROM (select array(1,7,8,4,5,3,2,6) as number) array join number
2.3.定义语法
2.3.1.窗口定义部分
OVER ()|(window_spec)|named_window
WINDOW window_name AS (window_spec)
[, window_name AS (window_spec)] ..
window_spec: [partition_clause] [order_clause] [frame_clause]
partition_clause: PARTITION BY expr [, expr] ...
order_clause: ORDER BY expr [ASC|DESC] [, expr [ASC|DESC]] ...
frame_clause: {ROWS | RANGE } frame_extent
frame_extent: {frame_start | frame_between}
frame_start: {
CURRENT ROW
| UNBOUNDED PRECEDING
| UNBOUNDED FOLLOWING
| expr PRECEDING
| expr FOLLOWING
}
frame_between: BETWEEN frame_start AND frame_end
frame_end: {
CURRENT ROW
| UNBOUNDED PRECEDING
| UNBOUNDED FOLLOWING
| expr PRECEDING
| expr FOLLOWING
}
2.3.2.函数定义部分
在定义了一个或多个窗口之后,可以在它们上面应用一个函数。
可以将两种类型的函数应用于窗口:
- so-called window native functions
- row_number: 为各组内值生成连续排序序号,其中,相同的值其序号也不相同。
- rank: 各组内值生成跳跃排序序号,其中,相同的值具有相同序号。
- dense_rank: 为各组内值生成连续排序序号,其中,相同的值具有相同序号。
- first_value: 取各组内的第一个值作为返回结果
- last_value: 取各组内的最后一个值作为返回结果
- lead: 暂不支持
lead(value any [, offset integer [, default any ]])
为各组内对应值生成提前值。即当前值对应的行数向后偏移offset位后所得行的value值作为序号。若经过向后偏移后行数超过当前组内的总行数,则对应结果取为default值。若无指定,在默认情况下,offset取为1,default值取为NULL。
- 变通实现:any(value) over (… rows between following and following )
- lag: 暂不支持
lag(value any [, offset integer [, default any ]])
为各组内对应值生成滞后值。即当前值对应的行数往前偏移offset位后所得行的value值作为序号。若经过偏移后行数不存在,则对应结果取为default值。若无指定,在默认情况下,offset取为1,default值取为NULL。
- any(value) over (… rows between preceding and preceding)
- 常规聚合函数:
- 支持在Windows上应用所有的标准聚合函数,如min, max, avg等
- 支持在Windows上应用特定的聚合函数,如argMin, argMax, avgWeighted等。
- 注意: initializeAggregation, aggregate function combinators, and parametric aggregate functions 这类特殊函数还不支持
2.4.潜在问题
对于分布式表,当前实现强制在查询节点上执行窗口函数计算。这意味着窗口函数的计算不会分散在分片之间,而是每个分片都必须将其数据发送给查询节点来执行计算。如果数据量很大,那么查询节点可能会接收和处理大量数据,并可能耗尽内存。
3.测试示例
3.1.测试数据
测试使用 用户在电商网站行为的事件数据,数据可以从这里下载,共10001条记录clickhouse-client -h ch01 --port 9000 --user default --password 123456 -q 'insert into ods.events format CSV' < events_10K.csv
--定义表结构
DROP TABLE IF EXISTS events;
CREATE TABLE events(
date DateTime,
product_id String,
user_id Int64,
event String,
extra_data String
) ENGINE = MergeTree()
ORDER BY date;
--查询数据
select * from events
3.2.查询示例
3.2.1.累加计算
按天统计购买次数、累加购买次数、销售额、累加销售额
SELECT
toDate(date) AS day,
count() AS purchases,
sum(purchases) OVER w AS cum_purchases,
avg(purchases) OVER w AS avg_purchases,
sum(JSONExtractFloat(extra_data, 'price')) AS revenue,
sum(revenue) OVER w AS cum_revenue
FROM events
WHERE event = 'buy'
--and date between '2017-01-01 00:00:00' and '2017-01-05 00:00:00' and event='buy'
GROUP BY day
WINDOW w AS (ORDER BY day ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
ORDER BY day ASC
LIMIT 10
3.2.2.Moving counts and averages
计算30天的移动平均销售额,以及过去30天的销售额总和。
SELECT
toDate(date) AS d,
sum(JSONExtractFloat(extra_data, 'price') AS price) AS revenue,
sum(revenue) OVER w AS `30d_total_revenue`,
avg(revenue) OVER w AS `30d_avg_revenue`
FROM events
WHERE event = 'buy' and date between '2017-04-01 00:00:00' and '2017-06-01 00:00:00' and event='buy'
GROUP BY d
WINDOW w AS (ORDER BY d ASC ROWS BETWEEN 30 PRECEDING AND CURRENT ROW)
ORDER BY d ASC
LIMIT 10
3.2.3.partition by clause
计算每月的销售额及 当月所属年的总销售额
SELECT
month,
revenue,
revenue_year,
revenue / revenue_year percentage
FROM (
SELECT
toYYYYMM(date) month
, toStartOfMonth(date) start_month_date
, sum(JSONExtractFloat(extra_data, 'price')) AS revenue
, sum(revenue) OVER (PARTITION BY toYear(start_month_date)) AS revenue_year
FROM events
WHERE event = 'buy' and date between '2017-01-01 00:00:00' and '2019-01-01 00:00:00' and event='buy'
GROUP BY month, start_month_date
) data_per_month
ORDER BY month ASC
LIMIT 24
3.2.4.RANGE clause
计算过去1个小时的总销售额
SELECT
date,
count() OVER w AS purchases_past_hour,
JSONExtractFloat(extra_data, 'price') AS revenue_event,
sum(revenue_event) OVER w AS revenue_past_hour
FROM events
WHERE event = 'buy'
WINDOW w AS (ORDER BY date ASC RANGE BETWEEN 3600 PRECEDING AND CURRENT ROW)
ORDER BY date ASC
LIMIT 40
3.2.5.RANK, ROW_NUMBER and DENSE_RANK
Clickhouse支持这些函数,它们的行为方式与Postgres相同。以下按每天进行分区,按所花费的钱进行排名(rank)
SELECT
date,
product_id,
user_id,
JSONExtractFloat(extra_data, 'price') AS price,
rank() OVER w AS rank,
dense_rank() OVER w AS dense_rank,
row_number() OVER w AS row_number
FROM events
WHERE event = 'buy'
WINDOW w AS (PARTITION BY toDate(date) ORDER BY price DESC)
ORDER BY date ASC
LIMIT 10
3.2.6. lead & lag
lag(上一个), lead(下一个)在clickhouse开窗中尚未支持,可通过间接方式实现,通过指定计算的行范围来实现
select toDate(date) AS day,
date,
JSONExtractFloat(extra_data, 'price') as price,
any(price) OVER lagW as `lag上一个`,
any(price) OVER leadW as `lead下一个`
from events
where event='buy' and date between '2017-01-01 00:00:00' and '2017-01-05 00:00:00'
WINDOW lagW as (ORDER BY price DESC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING),
leadW as (ORDER BY price DESC ROWS BETWEEN 1 following AND 1 following)
order by price desc
参考
clickhouse-window-functions-current-state-of-the-artclickhouse window-functionspostgresql SQL-WINDOWpostgresql SYNTAX-WINDOW-FUNCTIONmysql window-functionmysql window-functions-usage