Spark-Benchmark

基准数据

col_name            	data_type           
username            	string
name                	string
blood_group         	string
company             	string
birthdate           	string
sex                 	string
job                 	string
ssn                 	string
mail                	string

测试方法:

Fake数据,数据以10W每批通过Mobius写入,记录不同集群环境的写入耗时,Merge耗时及常见统计查询的耗时。

查询耗时:3次查询耗时求平均。

数据写入的机器和集群所处的网络环境相同,即网络数据传输均为内网传输。

关键语句

SQL

GroupBy

select blood_group from contact group by blood_group

Distinct

select distinct(blood_group) from contact

Count(Distinct)

select sex, count(distinct(username)) from contact group by sex

Where

select sex, count(distinct(username)) from contact where blood_group in (‘A+’, ‘B-’, ‘AB+’) group by sex

UDF

select count(merge(blood_group, sex)) from contact

合表1

create table contact2 as select * from contact

合表2

create table contact2 as select * from contact where sex=‘M’

测试结果

线下集群

运行环境

Driver Mem: 8G, 50Core, Executor Mem: 20G, Node Count: 5

数据操作性能

数据量

表名

文件大小

写入耗时

Merge

合表1

合表2

1000万

contact01

1.1G

5min

26s

7.7s

9.6s

1亿

contact

11.1G

83.3min

4.3min

1.3min

1.2min

10亿

contact10

110.7G

8.3h

44min

8.5min

4.7min

查询性能(Spark 1.3.1, 2015-05-18)

数据量

Count

GroupBy

Distinct

Count_Distinct

Where

UDF

1000万

1s

2s

2.1s

6.6s

5s

4s

1亿

4.6s

9.2s

9.2s

43s

36s

29s

10亿

1.2min

1.4min

1.2min

7.6min

5.6min

2.6min

查询性能(Spark 1.4.0, 2015-07-03)

数据量

Count

GroupBy

Distinct

Count_Distinct

Where

UDF

1000万

0.49s

1.3s

1.36s

7.26s

5.33s

2.74s

1亿

2.48s

9.06s

8.48s

48.58s

36.94s

25.9s

10亿

1.2min

1.3min

1.3min

7.4min

6min

4.6min

查询性能(Spark 1.4.1, 2015-07-17)

数据量

Count

GroupBy

Distinct

Count_Distinct

Where

UDF

1000万

1.81s

1.45s

1.19s

6.66s

3.75s

2.63s

1亿

5.88s

9.01s

9.65s

49.06s

38.42s

17.62s

10亿

1.26min

1.4min

1.4min

7.6min

5.9min

2.8min

Parquet格式

数据量

Count

GroupBy

Distinct

Count_Distinct

Where

UDF

1000万

0.48s

0.81s

0.83s

5.49s

3.2s

2.13s

1亿

0.63s

1.26s

1.41s

34.52s

15.32s

2.82s

10亿

2.48s

7.06s

8.21s

315.84s

142.91s

20.31s

查询性能(Spark 1.5.2, 2015-11-26)

Parquet

数据量

Count

GroupBy

Distinct

Count_Distinct

Where

UDF

1000万

0.5s

0.81s

0.83s

5.3s

3.04s

1.55s

1亿

0.58s

1.29s

1.23s

30.71s

14.32s

2.92s

10亿

1.76s

7.32s

7.35s

259.33s

122.62s

18.71s

关闭tungsten

数据量

Count

GroupBy

Distinct

Count_Distinct

Where

UDF

1000万

0.63s

0.99s

0.77s

6.02s

4.17s

1.88s

1亿

0.67s

1.35s

1.34s

12.12s

6.56s

9.52s

10亿

2.63s

8.03s

7.69s

63.57s

35.58s

42.77s

查询性能(Spark 1.6.0, 2016-01-07)

Parquet

数据量

Count

GroupBy

Distinct

Count_Distinct

Where

UDF

1000万

0.63s

0.99s

0.77s

6.02s

4.17s

1.88s

1亿

0.67s

1.35s

1.34s

12.12s

6.56s

9.52s

10亿

2.63s

8.03s

7.69s

63.57s

35.58s

42.77s

查询性能(Spark 1.6.1, 2016-05-19)

数据量

Count

GroupBy

Distinct

Count_Distinct

Where

UDF

1000万

0.34s

0.6s

0.59s

3.32s

2.17s

1.54s

1亿

0.54s

1.14s

1.16s

8.48s

6.66s

3.51s

10亿

2.77s

7.85s

7.87s

73.42s

42.96s

29.31s

查询性能(Spark 1.6.2, 2016-07-09)

数据量

Count

GroupBy

Distinct

Count_Distinct

Where

UDF

1000万

0.42s

0.56s

0.48s

2.37s

1.39s

0.82s

1亿

0.56s

0.98s

0.96s

8.28s

4.77s

2.87s

10亿

2.55s

6.27s

5.85s

67.11s

35.9s

20.66s

查询性能(Spark 1.6.2, 2016-09-27)

数据量

Count

GroupBy

Distinct

Count_Distinct

Where

UDF

1000万

0.4s

0.57s

0.43s

10.17s

4.83s

2.94s

1亿

0.39s

0.89s

0.91s

31.53s

13.6s

10.57s

10亿

1.83s

4.62s

4.81s

67.9s

28.55s

64.32s

查询性能(Spark 2.1.0, 2017-01-02)

数据量

Count

GroupBy

Distinct

Count_Distinct

Where

UDF

1000万

0.67s

0.5s

0.46s

2.76s

1.47s

1.6s

1亿

0.24s

0.73s

0.69s

7.74s

5.05s

5.66s

10亿

0.65s

3.0s

2.97s

46.51s

25.45s

50.23s

查询性能(Spark 2.2.0, 2017-07-09)

数据量

Count

GroupBy

Distinct

Count_Distinct

Where

UDF

1000万

0.21s

0.33s

0.33s

1.88s

1.08s

1.7s

1亿

0.22s

0.71s

0.6s

6.84s

4.31s

5.56s

10亿

0.46s

3.43s

3.29s

50.38s

25.55s

43.93s

线下集群

运行环境

Driver Mem: 8G, 50Core, Executor Mem: 20G, Node Count: 5

查询性能(Spark 2.2.0, 2020-07-24)

数据量

Count

GroupBy

Distinct

Count_Distinct

Where

UDF

1000万

5.67s

2.89s

0.83s

8.74s

3.88s

5.54s

1亿

1.3s

0.32s

0.36s

0.48s

0.4s

0.38s

10亿

1.08s

3.0s

2.57s

250.25s

116.11s

48.98s

查询性能(Spark 2.3.0, 2020-07-24)

数据量

Count

GroupBy

Distinct

Count_Distinct

Where

UDF

1000万

3.02s

1.22s

0.45s

8.22s

3.19s

2.89s

1亿

0.23s

0.2s

0.2s

0.27s

0.26s

0.3s

10亿

0.59s

3.54s

3.4s

270.14s

111.44s

49.32s

查询性能(Spark 2.4.6, 2020-07-24)

数据量

Count

GroupBy

Distinct

Count_Distinct

Where

UDF

1000万

0.41s

0.51s

0.48s

8.29s

3.07s

3.3s

1亿

0.2s

0.21s

0.22s

0.25s

0.24s

0.23s

10亿

0.44s

3.26s

3.43s

257.84s

113.04s

54.31s

线上集群(bdp-192)

运行环境

Driver Mem: 25G, 216Core, Executor Mem: 18G, Node Count: 5

查询性能(Spark 2.3.0, 2020-08-4)

数据量

Count

GroupBy

Distinct

Count_Distinct

Where

UDF

1000万

0.54s

0.49s

0.35s

1.32s

0.97s

1.01s

1亿

1.02s

0.98s

0.76s

3.65s

2.42s

3.71s

10亿

3.53s

2.9s

2.79s

17.37s

11.25s

27.25s

查询性能(Spark 2.4.6, 2020-08-04)

数据量

Count

GroupBy

Distinct

Count_Distinct

Where

UDF

1000万

0.58s

0.51s

0.37s

1.33s

0.92s

0.73s

1亿

0.45s

0.85s

0.81s

3.77s

2.68s

4.89s

10亿

1.28s

2.87s

2.79s

18.12s

12.13s

33.11s

新集群 性能测试
spark-extra.conf

spark.scheduler.mode FAIR

spark.executor.extraJavaOptions -Dtag=mobius.query.spi-wdb8 -XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -XX:PermSize=256M -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:G1HeapRegionSize=1m -Xloggc:./gc.log -verbose:gc

spark.memory.fraction 0.8

spark.memory.storageFraction 0.2



spark.speculation.quantile 0.92

spark.speculation.multiplier 2

spark.speculation.interval 200ms

spark.executor.cores 36



spark.driver.maxResultSize 2048m

spark.sql.shuffle.partitions 180

spark.speculation true

spark.kryoserializer.buffer.max 512m

spark.shuffle.consolidateFiles true

spark.sql.autoBroadcastJoinThreshold -1

spark.sql.mergeSchema.parallelize 2

spark.locality.wait 100



spark.cleaner.ttl 3600

spark.default.parallelism 180



spark.sql.adaptive.enabled true

spark.sql.adaptive.shuffle.targetPostShuffleInputSize 1024880

spark.sql.adaptive.minNumPostShufflePartitions 4



spark.ui.retainedJobs 1000

spark.ui.retainedStages 2000

spark.sql.crossJoin.enabled true

新机器 spark 1.6.2 5节点共180 cores,每节点36 cores 18G memory, driver 10 cores, 25G memory 2016-09-30

数据量

Count

GroupBy

Distinct

Count_Distinct

Where

UDF

1000万

0.23s

0.31s

0.31s

1.11s

0.82s

1.21s

1亿

0.99s

1.05s

0.87s

4.94s

3.25s

1.89s

10亿

1.2s

2.93s

3.36s

22.51s

12.39s

8.27s

新机器 spark 2.2.0 5节点共180 cores,每节点36 cores 18G memory, driver 10 cores, 25G memory 2017-11-01

数据量

Count

GroupBy

Distinct

Count_Distinct

Where

UDF

1000万

0.48s

0.53s

0.34s

1.37s

0.61s

0.82s

1亿

0.45s

0.98s

0.56s

4.44s

2.35s

3.77s

10亿

0.61s

1.51s

1.23s

18.41s

10.74s

26.82s

线上集群

运行环境

Driver Mem: 25G, 40Core, Executor Mem: 4G, Node Count: 13

数据操作性能

数据量

表名

文件大小

写入耗时

Merge

合表1

合表2

1000万

contact01

1.1G

5min

27s

7.2s

4.4s

1亿

contact

11.1G

83.3min

4min

63s

40s

10亿

contact10

110.7G

8.3h

54min

10min

5min

查询性能(Spark1.3.1, 13个节点,2015-05-18)

数据量

Count

GroupBy

Distinct

Count_Distinct

Where

UDF

1000万

1.6s

1.7s

1.4s

5s

3.8s

2.1s

1亿

2.6s

6.7s

6.8s

50s

29s

12s

10亿

32s

52s

51s

7.4min

4.5min

99s

查询性能(Spark1.3.1, 51台, 2015-06-06)

Driver Mem: 25G, 150Core, Executor Mem: 6G, Node Count: 51

数据量

Count

GroupBy

Distinct

Count_Distinct

Where

UDF

1000万

0.8s

1.5s

1.4s

4.6s

3.2s

1.8s

1亿

1.8s

4s

3.2s

35s

14s

4.3s

10亿

8s

16s

17s

4.4min

2.1min

27s

查询性能(Spark1.6.1, 59台, 2016-05-13)

Driver Mem: 25G, 150Core, Executor Mem: 6G

数据量

Count

GroupBy

Distinct

Count_Distinct

Where

UDF

1000万

0.62s

0.75s

0.78s

2.5s

1.88s

0.92s

1亿

0.66s

1.52s

1.32s

6.2s

4.85s

3.57s

10亿

2.5s

3.64s

3.26s

17.75s

9.67s

19.15s

create table contact50w as SELECT * FROM contact01 TABLESAMPLE (500000 ROWS)

Spark并发性能测试

测试SQL: select blood_group from contact group by blood_group

对线上集群进行测试,Spark节点数据:13。

测试数据如下, 耗时/ms:

并发查询数

最小

平均

最大

1

7434

7499

8197

5

7557

10698

16514

10

18843

21328

37128

20

35113

39487

42407

实时join查询性能测试

测试方法: ab -n 20 -c 1 [url] , 执行20次查询请求,并发数为1,取平均

测试两张表进行left join,执行SELECT a,SUM(b) FROM temp GROUP BY a LIMIT 10000,对比表实体化与否的查询性能。

数据量指的是join后的数据量。

数据量

未实体化

实体化

1000

450.216ms

531.483ms

20W

1282.255ms

674.580ms

1000W

6869.201ms

2634.941ms

BDP-Benchmark

使用 tassadar 的 tools 下的 bdp_benchmark.py 脚本,往 bdp-192 上的 mobius 依次发请求,获得的结果。

Spark 2.2.0

数据量

LinkRelativeRatio

DateDistinct

Filter

Count

SumByDay

DayOnDayBasis

1000万

1.7s

0.75s

1.17s

0.43s

1.05s

1.19s

1亿

3.03s

2.36s

2.81s

0.41s

2.41s

2.54s

10亿

17.74s

15.76s

16.96s

1.79s

16.89s

17.17s

Spark 1.6.2

数据量

LinkRelativeRatio

DateDistinct

Filter

Count

SumByDay

DayOnDayBasis

1000万

1.7s

0.86s

1.6s

0.49s

1.08s

1.5s

1亿

5.49s

2.29s

5.49s

0.82s

2.36s

4.12s

10亿

30.85s

16.74s

34.36s

2.52s

17.84s

30.35s