背景
我们的数据挖掘平台对数据统计有比较迫切的需求,而Spark本身对数据统计已经做了一些工作,希望梳理一下Spark已经支持的数据统计功能,后期再进行扩展。
准备数据
在参考文献6中下载鸢尾花数据,此处格式为iris.data格式,先将data后缀改为csv后缀(不影响使用,只是为了保证后续操作不需要修改)。
数据格式如下:
SepalLength | SepalWidth | PetalLength | PetalWidth | Name |
5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
4.9 | 3 | 1.4 | 0.2 | Iris-setosa |
4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
5 | 3.6 | 1.4 | 0.2 | Iris-setosa |
5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa |
4.6 | 3.4 | 1.4 | 0.3 | Iris-setosa |
数据说明见附录中的鸢尾花数据
。
我们先把数据放到Spark sql数仓中
CREATE TABLE IF NOT EXISTS iris ( SepalLength FLOAT , SepalWidth FLOAT
, PetalLength FLOAT , PetalWidth FLOAT
, Species VARCHAR(100)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/mnt/disk1/starqiu/iris';
表的分析与统计
Analyze Table语法如下:
ANALYZE TABLE [db_name.]table_name COMPUTE STATISTICS [analyze_option]
Collect statistics about the table that can be used by the query optimizer to find a better plan.
可以看到Spark表的分析可以为spark sql做查询优化,以便得到更好的查询性能。Spark Sql默认使用CBO(基于代价的优化),这在多表join查询时尤其有用。
此处的analyze_option
参数主要分为两类,表统计和列统计。
表统计
表的基本统计信息一般包括记录总数和所占空间。
Table statistics用法如下:
ANALYZE TABLE [db_name.]table_name COMPUTE STATISTICS [NOSCAN]
Collect only basic statistics for the table (number of rows, size in bytes).
NOSCAN
Collect only statistics that do not require scanning the whole table (that is, size in bytes).
运行命令ANALYZE TABLE iris COMPUTE STATISTICS;
可以得到表的记录总数和所占空间大小。如果不想全表扫描,加上NOSCAN
关键字,不会全表扫描,但只能得到所占空间大小。
表统计信息的描述命令语法如下:
DESCRIBE [EXTENDED] [db_name.]table_name
Return the metadata of an existing table (column names, data types, and comments). If the table does not exist, an exception is thrown.
EXTENDED
Display detailed information about the table, including parent database, table type, storage information, and properties.
Describe Partition
运行DESCRIBE EXTENDED iris;
,结果如下:
spark-sql> DESCRIBE EXTENDED iris;
SepalLength float NULL
SepalWidth float NULL
PetalLength float NULL
PetalWidth float NULL
Species string NULL
# Detailed Table Information CatalogTable(
Table: `default`.`iris`
Owner: root
Created: Sat Feb 16 17:24:32 CST 2019
Last Access: Thu Jan 01 08:00:00 CST 1970
Type: EXTERNAL
Schema: [StructField(SepalLength,FloatType,true), StructField(SepalWidth,FloatType,true), StructField(PetalLength,FloatType,true), StructField(PetalWidth,FloatType,true), StructField(Species,StringType,true)]
Provider: hive
Properties: [rawDataSize=-1, numFiles=0, transient_lastDdlTime=1550311815, totalSize=0, COLUMN_STATS_ACCURATE=false, numRows=-1]
Statistics: sizeInBytes=3808, rowCount=150, isBroadcastable=false
Storage(Location: hdfs://data126:8020/mnt/disk1/starqiu/iris, InputFormat: org.apache.hadoop.mapred.TextInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, Serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Properties: [field.delim=,, serialization.format=,])
Partition Provider: Catalog)
Time taken: 0.112 seconds, Fetched 7 row(s)
通过Statistics:
可以看到表的记录总数是150条,所占空间3808B,约4KB。
列统计
Column statistics用法如下:
ANALYZE TABLE [db_name.]table_name COMPUTE STATISTICS FOR COLUMNS col1 [, col2, ...]
Collect column statistics for the specified columns in addition to table statistics.
Tip
Use this command whenever possible because it collects more statistics so the optimizer can find better plans. Make sure to collect statistics for all columns used by the query.
列统计的描述命令语法如下:
DESCRIBE [EXTENDED][db_name.]table_name column_name
New in version runtime-3.3.
EXTENDED
Display detailed information about the specified columns, including the column statistics collected by the command ANALYZE TABLE tablename COMPUTE STATISTICS FOR COLUMNS columnname [column_name, ...].
需要注意的是这个功能在runtime-3.3版本才有的特性,而runtime-3.3封装的是Spark 2.2,会详见文末附录的databricks Runtime版本与Spark版本的对应关系
运行命令ANALYZE TABLE iris COMPUTE STATISTICS FOR COLUMNS SepalLength, SepalWidth, PetalLength, PetalWidth, Species;
计算指定多列的统计信息,
运行DESCRIBE EXTENDED iris SepalLength;
获取指定一列的统计信息,结果如下:
spark-sql> ANALYZE TABLE iris COMPUTE STATISTICS FOR COLUMNS SepalLength, SepalWidth, PetalLength, PetalWidth, Species;
Time taken: 4.45 seconds
spark-sql> DESCRIBE EXTENDED iris PetalWidth;
col_name PetalWidth
data_type float
comment NULL
min 0.10000000149011612
max 2.5
num_nulls 0
distinct_count 21
avg_col_len 4
max_col_len 4
histogram NULL
Time taken: 0.104 seconds, Fetched 10 row(s)
目前测试Spark2.2.2不支持该语句,但是Spark2.4.0支持。如果不支持,则可以通过访问hive的元数据库也可以得到这些信息,sql语句如下:
select param_key, param_value
from TABLE_PARAMS tp, TBLS t
where tp.tbl_id=t.tbl_id and tbl_name = 'iris'
and param_key like 'spark.sql.stat%';
以下是PetalWidth列的统计结果,可以看到包含不重复的记录数,空值数,最大值、最小值,平均长度以及最大长度
param_key | param_value |
spark.sql.statistics.colStats.PetalWidth.avgLen | 4 |
spark.sql.statistics.colStats.PetalWidth.distinctCount | 21 |
spark.sql.statistics.colStats.PetalWidth.max | 2.5 |
spark.sql.statistics.colStats.PetalWidth.maxLen | 4 |
spark.sql.statistics.colStats.PetalWidth.min | 0.10000000149011612 |
spark.sql.statistics.colStats.PetalWidth.nullCount | 0 |
spark.sql.statistics.colStats.PetalWidth.version | 1 |
总结
可以看到这些统计信息不仅对了解数据质量非常有用,对使用Spark sql进行查询也能得到优化,进一步提升速度。后续再写一篇CBO如何利用这些信息进行优化。
目前还不清楚Runtime中的Spark功能和开源版的有无差异,但Spark2.4支持表的分析统计操作,建议平台后续项目升级到Spark2.4 。
附录
鸢尾花数据说明
Iris数据集是常用的分类实验数据集,由Fisher, 1936收集整理。Iris也称鸢尾花卉数据集,是一类多重变量分析的数据集。数据集包含150个数据集,分为3类,每类50个数据,每个数据包含4个属性。iris以鸢尾花的特征作为数据来源,常用在分类操作中。该数据集由3种不同类型的鸢尾花的50个样本数据构成。其中的一个种类与另外两个种类是线性可分离的,后两个种类是非线性可分离的。
四个属性:
Sepal.Length(花萼长度),单位是cm;
Sepal.Width(花萼宽度),单位是cm;
Petal.Length(花瓣长度),单位是cm;
Petal.Width(花瓣宽度),单位是cm;
三个种类:
Iris Setosa(山鸢尾);
Iris Versicolour(杂色鸢尾);
Iris Virginica(维吉尼亚鸢尾)。
databricks Runtime
Runtime是databricks 统一分析平台的一部分,官网描述如下:
Accelerate innovation by unifying data science, engineering and business, with the Databricks Unified Analytics Platform, from the original creators of Apache Spark™.
Runtime的描述如下:
Simplify operations and get up to 50x better performance with cloud-optimized Apache Spark™.
可以看到主要是基于云优化来简化操作并提升50倍以上的性能。
id6)
Current Releases
Version | Spark Version | Release Date | Deprecation Announcement | Deprecation Date |
5.2 | Spark 2.4 | Jan 24, 2019 | May 27, 2019 | Sep 30, 2019 |
5.1 | Spark 2.4 | Dec 18, 2018 | Apr 18, 2019 | Aug 19, 2019 |
5.0 | Spark 2.4 | Nov 08, 2018 | Mar 08, 2019 | Jul 08, 2019 |
4.3 | Spark 2.3 | Aug 10, 2018 | Dec 09, 2018 | Apr 09, 2019 |
4.2 | Spark 2.3 | Jul 09, 2018 | Nov 05, 2018 | Mar 05, 2019 |
3.5-LTS | Spark 2.2 | Dec 21, 2017 | Jan 02, 2019 | Jan 02, 2020 |
Marked for Deprecation
Version | Spark Version | Release Date | Deprecation Announcement | Deprecation Date |
4.3 | Spark 2.3 | Aug 10, 2018 | Dec 09, 2018 | Apr 09, 2019 |
4.2 | Spark 2.3 | Jul 09, 2018 | Nov 05, 2018 | Mar 05, 2019 |
3.5-LTS | Spark 2.2 | Dec 21, 2017 | Jan 02, 2019 | Jan 02, 2020 |
Deprecated Releases
Version | Spark Version | Release Date | Deprecation Announcement | Deprecation Date |
4.1 | Spark 2.3 | May 17, 2018 | Sep 17, 2018 | Jan 17, 2019 |
4.0 | Spark 2.3 | Mar 01, 2018 | Jul 01, 2018 | Nov 01, 2018 |
3.4 | Spark 2.2 | Nov 20, 2017 | Mar 31, 2018 | Jul 30, 2018 |
3.3 | Spark 2.2 | Oct 04, 2017 | Mar 31, 2018 | Jul 30, 2018 |
3.2 | Spark 2.2 | Sep 05, 2017 | Jan 30, 2018 | Apr 30, 2018 |
3.1 | Spark 2.2 | Aug 04, 2017 | – | Oct 30, 2017 |
3.0 | Spark 2.2 | Jul 11, 2017 | – | Sep 05, 2017 |
Spark 2.1 (Auto Updating) | Spark 2.1 | Dec 22, 2016 | Mar 31, 2018 | Jul 30, 2018 |
Spark 2.1.1-db6 | Spark 2.1 | Aug 03, 2017 | Mar 31, 2018 | Jul 30, 2018 |
Spark 2.1.1-db5 | Spark 2.1 | May 31, 2017 | – | Aug 03, 2017 |
Spark 2.1.1-db4 | Spark 2.1 | Apr 25, 2017 | Mar 31, 2018 | Jul 30, 2018 |
Spark 2.0 (Auto Updating) | Spark 2.0 | Jul 26, 2016 | Jan 30, 2018 | Apr 30, 2018 |
Spark 2.0.2-db4 | Spark 2.0 | Mar 24, 2017 | Jan 30, 2018 | Apr 30, 2018 |
Spark 1.6.3-db2 | Spark 1.6 | Mar 24, 2017 | Jan 30, 2018 | Jun 30, 2018 |