最近我们要根据 平时的 投放日志 统计出 每天的 活跃用户,留存 等指标,为此我进行了细致的调研。

我们的留存规则如下: 对于第1天的新增用户,如果第i 天,该用户有访问行为,则认为该用户为留存用户。

 

为此,我们简化为如下3张表的一个统计需求。

日志表,全量用户表,每天去重用户表

 

表结构介绍:

 

这几张表的字段如下:

 

日志表

日志表主要有以下几个字段:

id  用户id

action 用户的具体行为

day  产生日志的时间 (按天分区)

 

日志表存储了每天的全量日志,不会对用户维度的数据进行去重,结构和数据如下:

0: jdbc:hive2://cdh-manager:10000> show create table org_log;
INFO  : Compiling command(queryId=hive_20190411041523_d77fcf8d-74d4-4d56-a70e-47464e48beea): show create table org_log
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:createtab_stmt, type:string, comment:from deserializer)], properties:null)
INFO  : Completed compiling command(queryId=hive_20190411041523_d77fcf8d-74d4-4d56-a70e-47464e48beea); Time taken: 0.053 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hive_20190411041523_d77fcf8d-74d4-4d56-a70e-47464e48beea): show create table org_log
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hive_20190411041523_d77fcf8d-74d4-4d56-a70e-47464e48beea); Time taken: 0.044 seconds
INFO  : OK
+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE TABLE `org_log`(                            |
|   `id` int,                                        |
|   `action` string)                                 |
| PARTITIONED BY (                                   |
|   `day` date)                                      |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.mapred.TextInputFormat'       |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION                                           |
|   'hdfs://cdh-manager:8020/user/hive/warehouse/keep_active_test.db/org_log' |
| TBLPROPERTIES (                                    |
|   'last_modified_by'='hive',                       |
|   'last_modified_time'='1554816805',               |
|   'transient_lastDdlTime'='1554816805')            |
+----------------------------------------------------+

 

日志表内的数据:

0: jdbc:hive2://cdh-manager:10000> select * from org_log;
INFO  : Compiling command(queryId=hive_20190411041725_54407313-8d5f-499a-ad9e-33f77e6acba5): select * from org_log
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:org_log.id, type:int, comment:null), FieldSchema(name:org_log.action, type:string, comment:null), FieldSchema(name:org_log.day, type:date, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20190411041725_54407313-8d5f-499a-ad9e-33f77e6acba5); Time taken: 0.623 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hive_20190411041725_54407313-8d5f-499a-ad9e-33f77e6acba5): select * from org_log
INFO  : Completed executing command(queryId=hive_20190411041725_54407313-8d5f-499a-ad9e-33f77e6acba5); Time taken: 0.001 seconds
INFO  : OK
+-------------+-----------------+--------------+
| org_log.id  | org_log.action  | org_log.day  |
+-------------+-----------------+--------------+
| 1           | find            | 2019-04-09   |
| 2           | touch           | 2019-04-09   |
| 1           | touch           | 2019-04-09   |
| 3           | touch           | 2019-04-09   |
| 4           | touch           | 2019-04-10   |
| 1           | touch           | 2019-04-10   |
| 2           | find            | 2019-04-10   |
| 2           | key             | 2019-04-10   |
| 5           | touch           | 2019-04-10   |
| 1           | touch           | 2019-04-11   |
| 2           | touch           | 2019-04-11   |
| 3           | touch           | 2019-04-11   |
| 1           | touch           | 2019-04-12   |
| 2           | touch           | 2019-04-12   |
| 3           | touch           | 2019-04-12   |
+-------------+-----------------+--------------+
15 rows selected (1.16 seconds)

 

 

----------------------------------------------------------------

 

 

全量用户表

全量用户表 ,存储了所有去重后的用户

表内主要有2个字段 : 

id   用户id

create_day  用户第一次有记录的时间

0: jdbc:hive2://cdh-manager:10000> show create table all_user;
INFO  : Compiling command(queryId=hive_20190411041955_b65eedab-5205-4443-9ea2-8e6f1cf09a02): show create table all_user
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:createtab_stmt, type:string, comment:from deserializer)], properties:null)
INFO  : Completed compiling command(queryId=hive_20190411041955_b65eedab-5205-4443-9ea2-8e6f1cf09a02); Time taken: 0.035 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hive_20190411041955_b65eedab-5205-4443-9ea2-8e6f1cf09a02): show create table all_user
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hive_20190411041955_b65eedab-5205-4443-9ea2-8e6f1cf09a02); Time taken: 0.016 seconds
INFO  : OK
+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE TABLE `all_user`(                           |
|   `id` int,                                        |
|   `create_day` date)                               |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.mapred.TextInputFormat'       |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION                                           |
|   'hdfs://cdh-manager:8020/user/hive/warehouse/keep_active_test.db/all_user' |
| TBLPROPERTIES (                                    |
|   'transient_lastDdlTime'='1554819131')            |
+----------------------------------------------------+
13 rows selected (0.091 seconds)

 

内部数据:

0: jdbc:hive2://cdh-manager:10000> select * from all_user;
INFO  : Compiling command(queryId=hive_20190411045422_3b3e77eb-f023-4a31-a683-1ae2b9de81ac): select * from all_user
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:all_user.id, type:int, comment:null), FieldSchema(name:all_user.create_day, type:date, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20190411045422_3b3e77eb-f023-4a31-a683-1ae2b9de81ac); Time taken: 0.086 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hive_20190411045422_3b3e77eb-f023-4a31-a683-1ae2b9de81ac): select * from all_user
INFO  : Completed executing command(queryId=hive_20190411045422_3b3e77eb-f023-4a31-a683-1ae2b9de81ac); Time taken: 0.001 seconds
INFO  : OK
+--------------+----------------------+
| all_user.id  | all_user.create_day  |
+--------------+----------------------+
| 1            | 2019-04-09           |
| 2            | 2019-04-09           |
| 3            | 2019-04-09           |
| 4            | 2019-04-10           |
| 5            | 2019-04-10           |
+--------------+----------------------+
5 rows selected (0.158 seconds)

 

插入数据脚本:

insert into org_log partition(day='2019-04-11') values (1,'touch'),(2,'touch'),(3,'touch');

 

 

---------------------------------------------

每天去重用户表

每天去重用户表,存储了每天去重后的用户 : 

表内主要有以下3个字段:

id  用户id

day 日期(按天分区)

0: jdbc:hive2://cdh-manager:10000> show create table daily_active;
INFO  : Compiling command(queryId=hive_20190411045050_933fd2c2-e4fd-4aff-8943-f1f5747120b5): show create table daily_active
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:createtab_stmt, type:string, comment:from deserializer)], properties:null)
INFO  : Completed compiling command(queryId=hive_20190411045050_933fd2c2-e4fd-4aff-8943-f1f5747120b5); Time taken: 0.037 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hive_20190411045050_933fd2c2-e4fd-4aff-8943-f1f5747120b5): show create table daily_active
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hive_20190411045050_933fd2c2-e4fd-4aff-8943-f1f5747120b5); Time taken: 0.02 seconds
INFO  : OK
+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE TABLE `daily_active`(                       |
|   `id` int)                                        |
| PARTITIONED BY (                                   |
|   `day` date)                                      |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.mapred.TextInputFormat'       |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION                                           |
|   'hdfs://cdh-manager:8020/user/hive/warehouse/keep_active_test.db/daily_active' |
| TBLPROPERTIES (                                    |
|   'transient_lastDdlTime'='1554866026')            |
+----------------------------------------------------+
14 rows selected (0.105 seconds)

内部数据:

0: jdbc:hive2://cdh-manager:10000> select * from daily_active;
INFO  : Compiling command(queryId=hive_20190411045310_215d6bfd-ece8-49e0-8ebb-80e03765922c): select * from daily_active
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:daily_active.id, type:int, comment:null), FieldSchema(name:daily_active.day, type:date, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20190411045310_215d6bfd-ece8-49e0-8ebb-80e03765922c); Time taken: 0.126 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hive_20190411045310_215d6bfd-ece8-49e0-8ebb-80e03765922c): select * from daily_active
INFO  : Completed executing command(queryId=hive_20190411045310_215d6bfd-ece8-49e0-8ebb-80e03765922c); Time taken: 0.001 seconds
INFO  : OK
+------------------+-------------------+
| daily_active.id  | daily_active.day  |
+------------------+-------------------+
| 1                | 2019-04-09        |
| 2                | 2019-04-09        |
| 3                | 2019-04-09        |
| 1                | 2019-04-10        |
| 2                | 2019-04-10        |
| 4                | 2019-04-10        |
| 5                | 2019-04-10        |
| 1                | 2019-04-11        |
| 2                | 2019-04-11        |
| 3                | 2019-04-11        |
| 1                | 2019-04-12        |
| 2                | 2019-04-12        |
| 3                | 2019-04-12        |
+------------------+-------------------+
13 rows selected (0.259 seconds)

 

 

--------------------------------------------

 

 

统计流程 及 HQL 脚本:

流程如下:

第一步. 得出每天的去重用户

第二步. 根据当前的去重用户,更新全量用户表

第三步. 选取当天 之前的 1-30 day 用户,计算留存

 

计算当天的去重用户的 HQL 以及脚本:

HQL : 

[root@cdh-manager active_test]# cat insert_daily_active.hql 
use keep_active_test;

INSERT OVERWRITE table  daily_active partition(day = '${cal_day}')  SELECT DISTINCT ID FROM org_log WHERE day = '${cal_day}';

脚本:

[root@cdh-manager active_test]# cat insert_daily_active.sh 
#!/bin/bash

beeline -u jdbc:hive2://cdh-manager:10000 -n hive -f insert_daily_active.hql --hivevar cal_day=$1

 

更新全量用户的HQL 以及脚本:

HQL :

[root@cdh-manager active_test]# cat insert_distinct_alluser.hql 

use keep_active_test;

INSERT INTO table all_user (id, create_day)
SELECT  tmp.id, '${cal_day}' AS create_day FROM ( 
	SELECT DISTINCT id FROM org_log WHERE day = '${cal_day}'
) AS tmp
WHERE tmp.id NOT IN (SELECT id FROM all_user);

脚本:

[root@cdh-manager active_test]# cat insert_distinct_alluser.sh 
#!/bin/bash

echo $1
beeline -u jdbc:hive2://cdh-manager:10000 -n hive -f insert_distinct_alluser.hql --hivevar cal_day=$1

 

计算留存:

这里描述下主要的思路:

通过 全量用户表 选出 30天的增量用户, 与每天去重的用户表 JOIN , JOIN 的字段是用户ID ,  计算时间差值,按照时间差值分组统计。

 

 

有两个脚本,一个较为基础,

 

即原始脚本。帮助理解,无分组函数:

[root@cdh-manager active_test]# cat cal_daily_active2.hql 

use keep_active_test;

SELECT all_user.id, all_user.create_day, daily_active.day, datediff(daily_active.day, all_user.create_day )   
FROM all_user INNER JOIN daily_active
ON all_user.id = daily_active.id
WHERE all_user.create_day = '2019-04-09';

结果:

+--------------+----------------------+-------------------+------+
| all_user.id  | all_user.create_day  | daily_active.day  | _c3  |
+--------------+----------------------+-------------------+------+
| 1            | 2019-04-09           | 2019-04-11        | 2    |
| 1            | 2019-04-09           | 2019-04-09        | 0    |
| 1            | 2019-04-09           | 2019-04-12        | 3    |
| 1            | 2019-04-09           | 2019-04-10        | 1    |
| 2            | 2019-04-09           | 2019-04-11        | 2    |
| 2            | 2019-04-09           | 2019-04-09        | 0    |
| 2            | 2019-04-09           | 2019-04-12        | 3    |
| 2            | 2019-04-09           | 2019-04-10        | 1    |
| 3            | 2019-04-09           | 2019-04-11        | 2    |
| 3            | 2019-04-09           | 2019-04-09        | 0    |
| 3            | 2019-04-09           | 2019-04-12        | 3    |
+--------------+----------------------+-------------------+------+
11 rows selected (54.082 seconds)

 

 

最后是实际统计函数:

root@cdh-manager active_test]# cat cal_daily_active.hql 

use keep_active_test;


SELECT datediff(daily_active.day, all_user.create_day) AS day , COUNT(daily_active.id)  AS total 
FROM all_user INNER JOIN daily_active
ON all_user.id = daily_active.id
WHERE all_user.create_day = '2019-04-09'
GROUP BY  datediff(daily_active.day, all_user.create_day);

结果:

+------+--------+
| day  | total  |
+------+--------+
| 0    | 3      |
| 1    | 2      |
| 2    | 3      |
| 3    | 3      |
+------+--------+

 

第一行 : 留存天数

第二行: 留存人数