背景

我们的数据挖掘平台对数据统计有比较迫切的需求,而Spark本身对数据统计已经做了一些工作,希望梳理一下Spark已经支持的数据统计功能,后期再进行扩展。

准备数据

在参考文献6中下载鸢尾花数据,此处格式为iris.data格式,先将data后缀改为csv后缀(不影响使用,只是为了保证后续操作不需要修改)

数据格式如下:

SepalLength

SepalWidth

PetalLength

PetalWidth

Name

5.1

3.5

1.4

0.2

Iris-setosa

4.9

3

1.4

0.2

Iris-setosa

4.7

3.2

1.3

0.2

Iris-setosa

4.6

3.1

1.5

0.2

Iris-setosa

5

3.6

1.4

0.2

Iris-setosa

5.4

3.9

1.7

0.4

Iris-setosa

4.6

3.4

1.4

0.3

Iris-setosa

数据说明见附录中的鸢尾花数据

我们先把数据放到Spark sql数仓中

CREATE TABLE IF NOT EXISTS iris ( SepalLength FLOAT , SepalWidth FLOAT 
  , PetalLength FLOAT , PetalWidth FLOAT 
  , Species VARCHAR(100) 
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/mnt/disk1/starqiu/iris';

表的分析与统计

Analyze Table语法如下:

ANALYZE TABLE [db_name.]table_name COMPUTE STATISTICS [analyze_option]

Collect statistics about the table that can be used by the query optimizer to find a better plan.

可以看到Spark表的分析可以为spark sql做查询优化,以便得到更好的查询性能。Spark Sql默认使用CBO(基于代价的优化),这在多表join查询时尤其有用。

此处的analyze_option参数主要分为两类,表统计和列统计。

表统计

表的基本统计信息一般包括记录总数和所占空间。

Table statistics用法如下:

ANALYZE TABLE [db_name.]table_name COMPUTE STATISTICS [NOSCAN]

Collect only basic statistics for the table (number of rows, size in bytes).

NOSCAN

Collect only statistics that do not require scanning the whole table (that is, size in bytes).

运行命令ANALYZE TABLE iris COMPUTE STATISTICS;可以得到表的记录总数和所占空间大小。如果不想全表扫描,加上NOSCAN关键字,不会全表扫描,但只能得到所占空间大小。

表统计信息的描述命令语法如下:

DESCRIBE [EXTENDED] [db_name.]table_name

Return the metadata of an existing table (column names, data types, and comments). If the table does not exist, an exception is thrown.

EXTENDED

Display detailed information about the table, including parent database, table type, storage information, and properties.

Describe Partition

运行DESCRIBE EXTENDED iris;,结果如下:

spark-sql> DESCRIBE EXTENDED iris;
SepalLength    float    NULL
SepalWidth    float    NULL
PetalLength    float    NULL
PetalWidth    float    NULL
Species    string    NULL
        
# Detailed Table Information    CatalogTable(
    Table: `default`.`iris`
    Owner: root
    Created: Sat Feb 16 17:24:32 CST 2019
    Last Access: Thu Jan 01 08:00:00 CST 1970
    Type: EXTERNAL
    Schema: [StructField(SepalLength,FloatType,true), StructField(SepalWidth,FloatType,true), StructField(PetalLength,FloatType,true), StructField(PetalWidth,FloatType,true), StructField(Species,StringType,true)]
    Provider: hive
    Properties: [rawDataSize=-1, numFiles=0, transient_lastDdlTime=1550311815, totalSize=0, COLUMN_STATS_ACCURATE=false, numRows=-1]
    Statistics: sizeInBytes=3808, rowCount=150, isBroadcastable=false
    Storage(Location: hdfs://data126:8020/mnt/disk1/starqiu/iris, InputFormat: org.apache.hadoop.mapred.TextInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, Serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Properties: [field.delim=,, serialization.format=,])
    Partition Provider: Catalog)    
Time taken: 0.112 seconds, Fetched 7 row(s)

通过Statistics:可以看到表的记录总数是150条,所占空间3808B,约4KB。

列统计

Column statistics用法如下:

ANALYZE TABLE [db_name.]table_name COMPUTE STATISTICS FOR COLUMNS col1 [, col2, ...]

Collect column statistics for the specified columns in addition to table statistics.

Tip

Use this command whenever possible because it collects more statistics so the optimizer can find better plans. Make sure to collect statistics for all columns used by the query.

列统计的描述命令语法如下:

DESCRIBE [EXTENDED][db_name.]table_name column_name

New in version runtime-3.3.

EXTENDED

Display detailed information about the specified columns, including the column statistics collected by the command ANALYZE TABLE tablename COMPUTE STATISTICS FOR COLUMNS columnname [column_name, ...].

需要注意的是这个功能在runtime-3.3版本才有的特性,而runtime-3.3封装的是Spark 2.2,会详见文末附录的databricks Runtime版本与Spark版本的对应关系

运行命令ANALYZE TABLE iris COMPUTE STATISTICS FOR COLUMNS SepalLength, SepalWidth, PetalLength, PetalWidth, Species;计算指定多列的统计信息,

运行DESCRIBE EXTENDED iris SepalLength;获取指定一列的统计信息,结果如下:

spark-sql> ANALYZE TABLE iris COMPUTE STATISTICS FOR COLUMNS SepalLength, SepalWidth, PetalLength, PetalWidth, Species;
Time taken: 4.45 seconds
spark-sql> DESCRIBE EXTENDED iris PetalWidth;
col_name    PetalWidth
data_type    float
comment    NULL
min    0.10000000149011612
max    2.5
num_nulls    0
distinct_count    21
avg_col_len    4
max_col_len    4
histogram    NULL
Time taken: 0.104 seconds, Fetched 10 row(s)

目前测试Spark2.2.2不支持该语句,但是Spark2.4.0支持。如果不支持,则可以通过访问hive的元数据库也可以得到这些信息,sql语句如下:

select param_key, param_value 
from TABLE_PARAMS tp, TBLS t 
where tp.tbl_id=t.tbl_id and tbl_name = 'iris' 
and param_key like 'spark.sql.stat%';

以下是PetalWidth列的统计结果,可以看到包含不重复的记录数,空值数,最大值、最小值,平均长度以及最大长度

param_key

param_value

spark.sql.statistics.colStats.PetalWidth.avgLen

4

spark.sql.statistics.colStats.PetalWidth.distinctCount

21

spark.sql.statistics.colStats.PetalWidth.max

2.5

spark.sql.statistics.colStats.PetalWidth.maxLen

4

spark.sql.statistics.colStats.PetalWidth.min

0.10000000149011612

spark.sql.statistics.colStats.PetalWidth.nullCount

0

spark.sql.statistics.colStats.PetalWidth.version

1

总结

可以看到这些统计信息不仅对了解数据质量非常有用,对使用Spark sql进行查询也能得到优化,进一步提升速度。后续再写一篇CBO如何利用这些信息进行优化。

目前还不清楚Runtime中的Spark功能和开源版的有无差异,但Spark2.4支持表的分析统计操作,建议平台后续项目升级到Spark2.4 。

附录

鸢尾花数据说明

Iris数据集是常用的分类实验数据集,由Fisher, 1936收集整理。Iris也称鸢尾花卉数据集,是一类多重变量分析的数据集。数据集包含150个数据集,分为3类,每类50个数据,每个数据包含4个属性。iris以鸢尾花的特征作为数据来源,常用在分类操作中。该数据集由3种不同类型的鸢尾花的50个样本数据构成。其中的一个种类与另外两个种类是线性可分离的,后两个种类是非线性可分离的。

四个属性:

Sepal.Length(花萼长度),单位是cm;

Sepal.Width(花萼宽度),单位是cm;

Petal.Length(花瓣长度),单位是cm;

Petal.Width(花瓣宽度),单位是cm;

三个种类:

Iris Setosa(山鸢尾);

Iris Versicolour(杂色鸢尾);

Iris Virginica(维吉尼亚鸢尾)。

databricks Runtime

spark 怎么实现 循环查询 spark查询表的数据_java

Runtime是databricks 统一分析平台的一部分,官网描述如下:

Accelerate innovation by unifying data science, engineering and business, with the Databricks Unified Analytics Platform, from the original creators of Apache Spark™.

Runtime的描述如下:

Simplify operations and get up to 50x better performance with cloud-optimized Apache Spark™.

可以看到主要是基于云优化来简化操作并提升50倍以上的性能。


id6)

Current Releases

Version

Spark Version

Release Date

Deprecation Announcement

Deprecation Date

5.2

Spark 2.4

Jan 24, 2019

May 27, 2019

Sep 30, 2019

5.1

Spark 2.4

Dec 18, 2018

Apr 18, 2019

Aug 19, 2019

5.0

Spark 2.4

Nov 08, 2018

Mar 08, 2019

Jul 08, 2019

4.3

Spark 2.3

Aug 10, 2018

Dec 09, 2018

Apr 09, 2019

4.2

Spark 2.3

Jul 09, 2018

Nov 05, 2018

Mar 05, 2019

3.5-LTS

Spark 2.2

Dec 21, 2017

Jan 02, 2019

Jan 02, 2020

Marked for Deprecation

Version

Spark Version

Release Date

Deprecation Announcement

Deprecation Date

4.3

Spark 2.3

Aug 10, 2018

Dec 09, 2018

Apr 09, 2019

4.2

Spark 2.3

Jul 09, 2018

Nov 05, 2018

Mar 05, 2019

3.5-LTS

Spark 2.2

Dec 21, 2017

Jan 02, 2019

Jan 02, 2020

Deprecated Releases

Version

Spark Version

Release Date

Deprecation Announcement

Deprecation Date

4.1

Spark 2.3

May 17, 2018

Sep 17, 2018

Jan 17, 2019

4.0

Spark 2.3

Mar 01, 2018

Jul 01, 2018

Nov 01, 2018

3.4

Spark 2.2

Nov 20, 2017

Mar 31, 2018

Jul 30, 2018

3.3

Spark 2.2

Oct 04, 2017

Mar 31, 2018

Jul 30, 2018

3.2

Spark 2.2

Sep 05, 2017

Jan 30, 2018

Apr 30, 2018

3.1

Spark 2.2

Aug 04, 2017


Oct 30, 2017

3.0

Spark 2.2

Jul 11, 2017


Sep 05, 2017

Spark 2.1 (Auto Updating)

Spark 2.1

Dec 22, 2016

Mar 31, 2018

Jul 30, 2018

Spark 2.1.1-db6

Spark 2.1

Aug 03, 2017

Mar 31, 2018

Jul 30, 2018

Spark 2.1.1-db5

Spark 2.1

May 31, 2017


Aug 03, 2017

Spark 2.1.1-db4

Spark 2.1

Apr 25, 2017

Mar 31, 2018

Jul 30, 2018

Spark 2.0 (Auto Updating)

Spark 2.0

Jul 26, 2016

Jan 30, 2018

Apr 30, 2018

Spark 2.0.2-db4

Spark 2.0

Mar 24, 2017

Jan 30, 2018

Apr 30, 2018

Spark 1.6.3-db2

Spark 1.6

Mar 24, 2017

Jan 30, 2018

Jun 30, 2018