1. clickhouse中的窗口函数

1.1. 功能实现情况

Window Functions 在clickhouse的需求和呼声很高,早期的版本需要借助array类函数【如groupArray, arrayJoin, groupArrayMovingSumgroupArrayMovingAvg functions】实现,在v21.3版本进行了开窗函数的初步支持,但只是实验版本,在v21.9版本正式支持,可以在生产中使用。

  • ClickHouse release v21.9, 2021-09-09
    Mark window functions as ready for general use. Remove the allow_experimental_window_functions setting. #27184 (Alexander Kuzmenkov).
  • ClickHouse release v21.3 (LTS), 2021-03-12
    Introduce experimental support for window functions, enabled with allow_experimental_window_functions = 1. This is a preliminary, alpha-quality implementation that is not suitable for production use and will change in backward-incompatible ways in future releases. Please see the documentation for the list of supported features. #20337 (Alexander Kuzmenkov).

目前支持或不支持的功能,详见clickhouse window-functions

1.2. 如何设置

在版本v21.9前,默认没有开启针对窗口函数的支持

1.2.1. 临时修改

使用clickhouse-client进入交互模式,执行如下命令:
set allow_experimental_window_functions = 1

1.2.2. 永久修改

在users.xml配置文件中增加如下配置:
<allow_experimental_window_functions>1</allow_experimental_window_functions>

通过如下SQL可以查询当前的配置
select * FROMsystem.settings s WHERE name like 'allow_experimental_window_functions'

2. 窗口函数

2.1.概念

窗口函数允许用户根据数据行与所谓窗口【so-called window】中的当前行之间的某种关系对数据行执行计算,所以使用窗口函数时,必须始终记住当前行。

PARTITION BY从句: 根据指定的列,将数据拆分不同的窗口进行处理

Order by 从句: 对每个窗口中的行进行排序

Frame从句:表示窗口函数对每一行计算相应的Frame,Frame总是与窗口内的当前行一起滑动,Frame由BETWEEN…AND…进行定义,依据当前行 或 与当前行的Range关系进行定义

peers【同伴】:在使用 RANGE frames时,

  • 如果指定了ORDER BY子句,则当前行的peers是属于同一排序桶的行,示例见移动求和的计算
  • 如果没有指定ORDER BY子句,那么窗口分区中的所有行都被认为是当前行的peers

2.2.简单使用示例

  • 移动平均线的计算: 一个序列中最后4个值的平均值 即当前行与前面3行 SELECT number, avg(number) OVER (ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS mv4 FROM values('number Int8', 1, 2, 3, 4, 5, 6, 7, 8)
  • 移动求和的计算:SELECT number, sum(number) OVER (ORDER BY number ASC RANGE BETWEEN 2 PRECEDING AND CURRENT ROW) as sumVal FROM (select array(1,7,8,4,5,3,2,6) as number) array join number
  • 如果不指定order by,则窗口分区中的所有行都被认为是当前行的peers,即每行对应的Frame包含窗口中的所有行 SELECT number, sum(number) OVER (RANGE BETWEEN CURRENT ROW AND CURRENT ROW) as sumVal FROM (select array(1,7,8,4,5,3,2,6) as number) array join number
  • centos7安装clickhouse 单节点rpm方式 clickhouse windowfunnel_Window Function


2.3.定义语法

2.3.1.窗口定义部分

OVER ()|(window_spec)|named_window 

WINDOW window_name AS (window_spec)
    [, window_name AS (window_spec)] ..

window_spec:  [partition_clause] [order_clause] [frame_clause]
partition_clause: PARTITION BY expr [, expr] ...
order_clause: ORDER BY expr [ASC|DESC] [, expr [ASC|DESC]] ...
frame_clause: {ROWS | RANGE } frame_extent
frame_extent: {frame_start | frame_between}
frame_start: {
    CURRENT ROW
  | UNBOUNDED PRECEDING
  | UNBOUNDED FOLLOWING
  | expr PRECEDING
  | expr FOLLOWING
}
frame_between: BETWEEN frame_start AND frame_end
frame_end: {
    CURRENT ROW
  | UNBOUNDED PRECEDING
  | UNBOUNDED FOLLOWING
  | expr PRECEDING
  | expr FOLLOWING
}

2.3.2.函数定义部分

在定义了一个或多个窗口之后,可以在它们上面应用一个函数。
可以将两种类型的函数应用于窗口:

  • so-called window native functions
  • row_number: 为各组内值生成连续排序序号,其中,相同的值其序号也不相同。
  • rank: 各组内值生成跳跃排序序号,其中,相同的值具有相同序号。
  • dense_rank: 为各组内值生成连续排序序号,其中,相同的值具有相同序号。
  • first_value: 取各组内的第一个值作为返回结果
  • last_value: 取各组内的最后一个值作为返回结果
  • lead: 暂不支持lead(value any [, offset integer [, default any ]])为各组内对应值生成提前值。即当前值对应的行数向后偏移offset位后所得行的value值作为序号。若经过向后偏移后行数超过当前组内的总行数,则对应结果取为default值。若无指定,在默认情况下,offset取为1,default值取为NULL。
  • 变通实现:any(value) over (… rows between following and following )
  • lag: 暂不支持lag(value any [, offset integer [, default any ]])为各组内对应值生成滞后值。即当前值对应的行数往前偏移offset位后所得行的value值作为序号。若经过偏移后行数不存在,则对应结果取为default值。若无指定,在默认情况下,offset取为1,default值取为NULL。
  • any(value) over (… rows between preceding and preceding)
  • 常规聚合函数
  • 支持在Windows上应用所有的标准聚合函数,如min, max, avg等
  • 支持在Windows上应用特定的聚合函数,如argMin, argMax, avgWeighted等。
  • 注意: initializeAggregation, aggregate function combinators, and parametric aggregate functions 这类特殊函数还不支持

2.4.潜在问题

对于分布式表,当前实现强制在查询节点上执行窗口函数计算。这意味着窗口函数的计算不会分散在分片之间,而是每个分片都必须将其数据发送给查询节点来执行计算。如果数据量很大,那么查询节点可能会接收和处理大量数据,并可能耗尽内存。

3.测试示例

3.1.测试数据

测试使用 用户在电商网站行为的事件数据,数据可以从这里下载,共10001条记录
clickhouse-client -h ch01 --port 9000 --user default --password 123456 -q 'insert into ods.events format CSV' < events_10K.csv

--定义表结构
DROP TABLE IF EXISTS events;
CREATE TABLE events(
    date DateTime, 
    product_id String, 
    user_id Int64, 
    event String, 
    extra_data String
) ENGINE = MergeTree()
ORDER BY date;

--查询数据
select * from events

3.2.查询示例

3.2.1.累加计算

按天统计购买次数、累加购买次数、销售额、累加销售额

SELECT
    toDate(date) AS day,
    count() AS purchases,
    sum(purchases) OVER w AS cum_purchases,
    avg(purchases) OVER w AS avg_purchases,
    sum(JSONExtractFloat(extra_data, 'price')) AS revenue,
    sum(revenue) OVER w AS cum_revenue
FROM events
WHERE event = 'buy'
--and date between '2017-01-01 00:00:00' and '2017-01-05 00:00:00' and event='buy'
GROUP BY day
WINDOW  w AS (ORDER BY day ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
ORDER BY day ASC
LIMIT 10

centos7安装clickhouse 单节点rpm方式 clickhouse windowfunnel_Window Function_02

3.2.2.Moving counts and averages

计算30天的移动平均销售额,以及过去30天的销售额总和。

SELECT
    toDate(date) AS d,
    sum(JSONExtractFloat(extra_data, 'price') AS price) AS revenue,
    sum(revenue) OVER w AS `30d_total_revenue`,
    avg(revenue) OVER w AS `30d_avg_revenue`
FROM events
WHERE event = 'buy' and date between '2017-04-01 00:00:00' and '2017-06-01 00:00:00' and event='buy'
GROUP BY d
WINDOW  w AS (ORDER BY  d ASC ROWS BETWEEN 30 PRECEDING AND CURRENT ROW)
ORDER BY d ASC
LIMIT 10

3.2.3.partition by clause

计算每月的销售额及 当月所属年的总销售额

SELECT 
    month,
    revenue,
    revenue_year,
    revenue / revenue_year percentage
FROM (
    SELECT 
        toYYYYMM(date) month
        , toStartOfMonth(date) start_month_date
        , sum(JSONExtractFloat(extra_data, 'price')) AS revenue
        , sum(revenue) OVER (PARTITION BY toYear(start_month_date)) AS revenue_year
    FROM events
    WHERE event = 'buy' and date between '2017-01-01 00:00:00' and '2019-01-01 00:00:00' and event='buy'
    GROUP BY month, start_month_date
) data_per_month
ORDER BY month ASC
LIMIT 24

3.2.4.RANGE clause

计算过去1个小时的总销售额

SELECT
    date,
    count() OVER w AS purchases_past_hour,
    JSONExtractFloat(extra_data, 'price') AS revenue_event,
    sum(revenue_event) OVER w AS revenue_past_hour
FROM events
WHERE event = 'buy'
WINDOW  w AS (ORDER BY  date ASC RANGE BETWEEN 3600 PRECEDING AND CURRENT ROW)
ORDER BY date ASC
LIMIT 40

3.2.5.RANK, ROW_NUMBER and DENSE_RANK

Clickhouse支持这些函数,它们的行为方式与Postgres相同。以下按每天进行分区,按所花费的钱进行排名(rank)

SELECT
    date,
    product_id,
    user_id,
    JSONExtractFloat(extra_data, 'price') AS price,
    rank() OVER w AS rank,
    dense_rank() OVER w AS dense_rank,
    row_number() OVER w AS row_number
FROM events
WHERE event = 'buy'
WINDOW  w AS (PARTITION BY  toDate(date) ORDER BY  price DESC)
ORDER BY date ASC
LIMIT 10

3.2.6. lead & lag

lag(上一个), lead(下一个)在clickhouse开窗中尚未支持,可通过间接方式实现,通过指定计算的行范围来实现

select toDate(date) AS day,
	date,
	JSONExtractFloat(extra_data, 'price') as price,
	any(price) OVER lagW as `lag上一个`,
	any(price) OVER leadW as `lead下一个`
from events 
where event='buy' and date between '2017-01-01 00:00:00' and '2017-01-05 00:00:00' 
WINDOW lagW as (ORDER BY price DESC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING),
	   leadW as (ORDER BY price DESC ROWS BETWEEN 1 following  AND 1 following)
order by price desc

centos7安装clickhouse 单节点rpm方式 clickhouse windowfunnel_窗口函数_03

参考

clickhouse-window-functions-current-state-of-the-artclickhouse window-functionspostgresql SQL-WINDOWpostgresql SYNTAX-WINDOW-FUNCTIONmysql window-functionmysql window-functions-usage