一、背景
某电商平台为了合理的投入人力物力创造更大的销售利润,现对已有的销售数据进行用户分析,提出合理的促销计划。
围绕产品和用户两大方面展开为电商平台制定策略提供分析及建议。
二、需求
用户分析:从性别、年龄、 职业、城市、居住年限,婚姻状况等维度找到高质量用户,并查看高质量用户人群的占比,为其提供高价值消费品 (定位高价值消费品以销售金额评估)。针对其他的用户,主要引导用户进行购买,多推荐一些热销的商品(定位热销产品)
产品分析:从销量、销售额都高的产品并以二八法则找到高贡献的一级产品类目
三、数据介绍
假定每条记录为一单
--建表
create table model2_datas
(User_ID int,
Product_ID string,
Gender string,
Age string,
Occupation int,
City_Category string,
Stay_In_Current_City_Years string,
Marital_Status int,
Product_Category_1 int,
Product_Category_2 int,
Product_Category_3 int,
Purchase double
)
row format delimited fields terminated by ',' --指定分隔符csv 为逗号分割
tblproperties(
"skip.header.line.count"="1" --跳过⽂件⾏⾸1⾏
);
--装载数据
load data local inpath '/home/hadoop/datas/model2_datas.csv' overwrite into
table model2_datas
四、 使用方法
本案例是在Hadoop分布式集群环境中完成,主要用到hive基础函数和开窗函数。
Hadoop一般有三个重要部分组成,分别是分布式文件系统(HDFS)、分布式计算(MapReduce)、数据仓库工具(Hive)。
- HDFS(Hadoop Distributed Fill System) 是Hadoop 项目的子项目,使用多台计算机存储文件,并且提供统一的访问接口(NameNode),像是访问一个普通文件系统一样使用分布式文件系统。
- MapReduce 是一种分布式并行编程框架
MapReduce 策略: 分而治之
MapReduce 理念: 计算向数据靠拢而不是数据向计算靠拢
Mapreduce工作流程 - Hive是Facebook为了解决海量日志数据的统计分析而开发的基于Hadoop的一个数据仓库工具(后来开源给了Apache软件基金会),可以将结构化的数据文件映射为一张数据库表,并提供类SQL查询功能:HQL。
Hive特点:
Hive 本身并不支持数据存储和处理,只是一个面向用户的编程接口
Hive 依赖分布式文件系统HDFS存储数据
Hive 依赖分布式并行计算模型MapReduce 处理数据
Hive架构:
1.用户接口:Client CLI(hive shell)、JDBC/ODBC(java 访问 hive)、WEBUI(浏览器访问 hive)
2.元数据:Metastore 元数据包括:表名、表所属的数据库(默认是 default)、表的拥有者、列/分区字段、表的类型(是否是外部表)、表的数据所在目录等;
3.Hadoop 使用 HDFS 进行存储,使用 MapReduce 进行计算。
4.驱动器:Driver
(1)解析器(SQL Parser):将 SQL 字符串转换成抽象语法树 AST,这一步一般都用第三方工具库完成,比如 antlr;对 AST 进行语法分析,比如表是否存在、字段是否存在、SQL语义是否有误。
(2)编译器(Physical Plan):将 AST 编译生成逻辑执行计划。
(3)优化器(Query Optimizer):对逻辑执行计划进行优化。
(4)执行器(Execution):把逻辑执行计划转换成可以运行的物理计划。对于 Hive 来说,就是MR/Spark。
Hive 通过给用户提供的一系列交互接口,接收到用户的指令(SQL),使用自己的 Driver,结合元数据(MetaStore),将这些指令翻译成 MapReduce,提交到 Hadoop 中执行,最后,将执行返回的结果输出到用户交互接口。
Hive 的数据模型主要有以下四种:
五、 需求实现
- 查询订单整体的消费情况(包括:总销售额、人均消费、平均每单消费)
select
round(sum(purchase),2) totle_sales,
round(sum(purchase)/count(distinct user_id),2) per_cost,
round(sum(purchase)/count(*),2) avg_unit_cost
from
model2_datas;
==>
totle_sales per_cost avg_unit_cost
5.017668378E9 851751.55 9333.86
- 用户分析(找到高质量人群)
统计各性别消费情况(字段包含性别、人数、人数占比、人均消费、消费金额、消费占比) 并以消费占比降序
select
*,
concat(round((num_gender/sum(num_gender) over())*100,2),"%") as num_rate,
concat(round((cost_gender/sum(cost_gender) over())*100,2),"%") as cost_rate
from
(
select
gender,
count(distinct user_id) num_gender,
round(sum(purchase)/count(distinct user_id),2) avg_cost_gender,
round(sum(purchase),2) cost_gender
from
model2_datas
group by
gender
) a
order by
cost_rate desc;
==>:
a.gendera.num_gender a.avg_cost_gender a.cost_gender num_ratecost_rate
M 4225 911963.16 3.853044357E9 71.72% 76.79%
F 1666 699054.03 1.164624021E9 28.28% 23.21%
统计各年龄段消费情况(字段包含年龄段 、 人数、 人数占比、消费金额、 人均消费、消费占比)并以消费占比降序
select
*,
concat(round((num_age/sum(num_age) over())*100,2),"%") as num_rate,
concat(round((cost_age/sum(cost_age) over())*100,2),"%") as cost_rate
from
(
select
age,
count(distinct user_id) num_age,
round(sum(purchase)/count(distinct user_id),2) avg_cost_age,
round(sum(purchase),2) cost_age
from
model2_datas
group by
age
) a
order by
cost_rate desc;
==>:
a.age a.num_age a.avg_cost_age a.cost_age num_rate cost_rate
46-50 531 778565.39 4.13418223E8 9.01% 8.24%
51-55 481 752408.22 3.61908356E8 8.16% 7.21%
26-35 2053 974061.91 1.999749106E9 34.85% 39.85%
55+ 372 531222.69 1.97614842E8 6.31% 3.94%
36-45 1167 866023.62 1.010649565E9 19.81% 20.14%
0-17 218 608527.55 1.32659006E8 3.7% 2.64%
18-25 1069 843469.86 9.0166928E818.15% 17.97%
统计各职业消费情况(字段包含职业 、人数、人数占比、消费金额、人均消费、消费占比) 并以消费占比降序
select
*,
concat(round((num_o/sum(num_o) over())*100,2),"%") as num_rate,
concat(round((cost_o/sum(cost_o) over())*100,2),"%") as cost_rate
from
(
select
occupation,
count(distinct user_id) num_o,
round(sum(purchase)/count(distinct user_id),2) avg_cost_o,
round(sum(purchase),2) cost_o
from
model2_datas
group by
occupation
) a
sort by
cost_rate desc;
==>:
a.occupation a.num_oa.avg_cost_oa.cost_o num_rate cost_rate
1 517 801843.0 4.14552829E8 8.78% 8.26%
17 491 788676.89 3.87240355E8 8.33% 7.72%
12 376 799659.85 3.00672105E8 6.38% 5.99%
20 273 1070611.67 2.92276985E8 4.63% 5.82%
14 294 869369.88 2.55594745E8 4.99% 5.09%
16 235 997626.94 2.3444233E83.99% 4.67%
2 256 911232.0 2.33275393E8 4.35% 4.65%
6 228 811691.65 1.85065697E8 3.87% 3.69%
3 170 943696.76 1.6042845E82.89% 3.2%
15 140 832428.76 1.16540026E8 2.38% 2.32%
10 192 595176.84 1.14273954E8 3.26% 2.28%
5 111 1013741.94 1.12525355E8 1.88% 2.24%
11 128 823729.37 1.05437359E8 2.17% 2.1%
4 740 888554.59 6.57530393E8 12.56% 13.1%
0 688 909614.55 6.25814811E8 11.68% 12.47%
7 669 821050.44 5.49282744E8 11.36% 10.95%
19 71 1029795.62 7.3115489E71.21% 1.46%
13 140 508112.46 7.1135744E72.38% 1.42%
18 67 899249.34 6.0249706E71.14% 1.2%
9 88 609310.33 5.3619309E71.49% 1.07%
8 17 858505.82 1.4594599E70.29% 0.29%
统计各婚姻状况消费情况(字段包含婚姻状况 、人数、人数占比、消费金额、人均消费、消费占比) 并以消费占比降序
select
*,
concat(round((num_m/sum(num_m) over())*100,2),"%") as num_rate,
concat(round((cost_m/sum(cost_m) over())*100,2),"%") as cost_rate
from
(
select
marital_status,
count(distinct user_id) num_m,
round(sum(purchase)/count(distinct user_id),2) avg_cost_m,
round(sum(purchase),2) cost_m
from
model2_datas
group by
marital_status
) a
sort by
cost_rate desc;
==>:
a.marital_status a.num_m a.avg_cost_m a.cost_m num_rate cost_rate
0 3417 868097.6 2.9662895E958.0% 59.12%
1 2474 829174.97 2.051378878E9 42.0% 40.88%
依据以上查询结果找到高质量人群:
性别为男, 年龄段为 26-35, 职业为 0,4,7, 婚姻状况为 0
- 产品分析
查询出订单量TOP10的产品 (包含字段 排名编号、商品ID、订单量、订单量占比 )按订单量降序显示
select
rank() over(order by sales desc) as rank,
*,
concat(round((sales/sum(sales) over())*100,2),"%") as sales_rate
from
(
select
product_id,
count(*) as sales
from
model2_datas
group by
product_id
) a
order by
sales desc
limit 10;
==>:
ranka.product_id a.sales sales_rate
1 P00265242 1858 0.35%
2 P00110742 1591 0.3%
3 P00025442 1586 0.3%
4 P00112142 1539 0.29%
5 P00057642 1430 0.27%
6 P00184942 1424 0.26%
7 P00046742 1417 0.26%
8 P00058042 1396 0.26%9 P00145042 1384 0.26%
9 P00059442 1384 0.26%
查询出销售额TOP10的产品 (包含字段 排名编号、商品ID、销售额、销售额占比 )
select
rank() over(order by cost desc) as rank,
*,
concat(round((cost/sum(cost) over())*100,2),"%") as cost_rate
from
(
select
product_id,
sum(purchase) as cost
from
model2_datas
group by
product_id
) a
order by
cost desc
limit 10;
==>:
ranka.product_id a.cost cost_rate
1 P00025442 2.7532426E70.55%
2 P00110742 2.6382569E70.53%
3 P00255842 2.4652442E70.49%
4 P00184942 2.4060871E70.48%
5 P00059442 2.3948299E70.48%
6 P00112142 2.3882624E70.48%
7 P00110942 2.3232538E70.46%
8 P00237542 2.3096487E70.46%
9 P00057642 2.249369E7 0.45%
10 P00010742 2.1865042E70.44%
统计各一级产品类目的订单量、销售额、订单量占比、销售额占比、累计销售额占比 (以销售额占比降序),并根据查询结果找到累计销售额达到20%的几个一级产品类目
select
*,
concat(round((Accu_cost/sum(cost) over())*100,2),"%") as Accu_cost_rate
from
(
select
*,
concat(round((sales/sum(sales) over())*100,2),"%") as sales_rate,
concat(round((cost/sum(cost) over())*100,2),"%") as cost_rate,
sum(cost) over(order by cost desc rows between unbounded preceding and current row) Accu_cost
from
(
select
Product_Category_1,
count(*) as sales,
sum(purchase) as cost
from
model2_datas
group by
Product_Category_1
) as a
) as b
group by
Product_Category_1
order by
cost_rate desc
==>:
a.product_category_1 a.sales a.cost sales_rate cost_rate accu_cost
6 20164 3.19355286E8 3.75% 6.36% 3.969632502E9
2 23499 2.64497242E8 4.37% 5.27% 4.234129744E9
1 138353 1.882666325E9 25.74% 37.52% 1.882666325E9
3 19849 2.00412211E8 3.69% 3.99% 4.434541955E9
16 9697 1.43168035E8 1.8% 2.85% 4.57770999E911 23960 1.12203088E8 4.46% 2.24% 4.689913078E9
5 148592 9.26917497E8 27.64% 18.47% 2.809583822E9
8 112132 8.40693394E8 20.86% 16.75% 3.650277216E9
10 5032 9.9029631E70.94% 1.97% 4.788942709E9
15 6203 9.1658147E71.15% 1.83% 4.880600856E9
7 3668 6.0059209E70.68% 1.2% 4.940660065E9
4 11567 2.6937957E72.15% 0.54% 4.967598022E9
14 1500 1.9718178E70.28% 0.39% 4.9873162E9
18 3075 9149071.0 0.57% 0.18% 4.996465271E9
9 404 6277472.0 0.08% 0.13% 5.002742743E9
17 567 5758702.0 0.11% 0.11% 5.008501445E9
12 3875 5235883.0 0.72% 0.1% 5.013737328E9
13 5440 3931050.0 1.01% 0.08% 5.017668378E9
- 细化分析
查询出各性别销售额TOP10 产品 ( 字段包含商品ID、订单量、销售额、销售额占比、类别1、类别2、类别3)
select
*
from
(
select
*,
rank() over(partition by gender order by cost desc) as rank,
concat(round((cost/sum(cost) over())*100,2),"%") as cost_rate
from
(
select
gender,
product_id,
Product_Category_1,
Product_Category_2,
Product_Category_3,
count(*) sales,
sum(purchase) as cost
from model2_datas
group by
gender,
product_id,
Product_Category_1,
Product_Category_2,
Product_Category_3
) a
)b
where rank<11
order by
gender,
cost desc;
==>:
b.genderb.product_id b.product_category_1 b.product_category_2
b.product_category_3b.sales b.cost b.rank b.cost_rate
F P00255842 16 NULL NULL 366 6690088.0 1 0.13%
F P00059442 6 8 16 350 6007826.0 2 0.12%
F P00110842 1 2 5 351 5933348.0 3 0.12%
F P00025442 1 2 9 341 5763524.0 4 0.11%
F P00110742 1 2 8 357 5632357.0 5 0.11%
F P00110942 1 2 NULL 305 5066142.0 6 0.1%
F P00148642 6 10 13 300 5049905.0 7 0.1%
F P00112142 1 2 14 332 4901047.0 8 0.1%
F P00028842 6 8 NULL 289 4867128.0 9 0.1%
F P00184942 1 8 17 293 4723224.0 10 0.09%
M P00025442 1 2 9 1245 2.1768902E71 0.43%
M P00110742 1 2 8 1234 2.0750212E72 0.41%
M P00184942 1 8 17 1131 1.9337647E73 0.39%
M P00112142 1 2 14 1207 1.8981577E74 0.38%
M P00057642 1 15 16 1174 1.872036E7 5 0.37%
M P00237542 1 15 16 1092 1.8562039E76 0.37%
M P00110942 1 2 NULL 1031 1.8166396E77 0.36%
M P00255842 16 NULL NULL 988 1.7962354E78 0.36%
M P00059442 6 8 16 1034 1.7940473E79 0.36%
M P00010742 1 8 17 1056 1.7517618E710 0.35%
查询高质量用户群年龄段的订单量TOP10产品( 字段包含商品ID、订单量、销售额、订单量占比、销售额占比、类别1、类别2、类别3)
select
*
from
(
select
rank() over(order by sales desc) rank,
*,
concat(round((sales/sum(sales) over())*100,2),"%") as sales_rate,
concat(round((cost/sum(cost) over())*100,2),"%") as cost_rate
from
(
select
age,
product_id,
count(*) sales,
sum(purchase) cost,
Product_Category_1,
Product_Category_2,
Product_Category_3
from
model2_datas
where
age="26-35"
group by
age,
product_id,
Product_Category_1,
Product_Category_2,
Product_Category_3
) a
order by
sales desc
) b
where
rank < 11;
==>:
b.rank b.age b.product_id b.sales b.cost b.product_category_1
b.product_category_2 b.product_category_3 b.sales_rate b.cost_rate
10 26-35 P00255842 537 9860878.0 16 NULL NULL 0.250.49%
9 26-35 P00046742 544 7886516.0 1 2 15 0.250.39%
8 26-35 P00237542 568 9697110.0 1 15 16 0.260.48%
7 26-35 P00184942 570 9493975.0 1 8 17 0.270.47%
6 26-35 P00057642 581 9110947.0 1 15 16 0.270.46%
5 26-35 P00058042 584 5116374.0 8 16 NULL 0.270.26%
4 26-35 P00025442 595 1.0594786E71 2 9 0.280.53%
3 26-35 P00112142 597 9258356.0 1 2 14 0.280.46%
2 26-35 P00110742 627 1.0605442E71 2 8 0.290.53%
1 26-35 P00265242 732 5546869.0 5 8 NULL 0.340.28%