Spark-Benchmark
基准数据
col_name data_type
username string
name string
blood_group string
company string
birthdate string
sex string
job string
ssn string
mail string
测试方法:
Fake数据,数据以10W每批通过Mobius写入,记录不同集群环境的写入耗时,Merge耗时及常见统计查询的耗时。
查询耗时:3次查询耗时求平均。
数据写入的机器和集群所处的网络环境相同,即网络数据传输均为内网传输。
关键语句 | SQL |
GroupBy | select blood_group from contact group by blood_group |
Distinct | select distinct(blood_group) from contact |
Count(Distinct) | select sex, count(distinct(username)) from contact group by sex |
Where | select sex, count(distinct(username)) from contact where blood_group in (‘A+’, ‘B-’, ‘AB+’) group by sex |
UDF | select count(merge(blood_group, sex)) from contact |
合表1 | create table contact2 as select * from contact |
合表2 | create table contact2 as select * from contact where sex=‘M’ |
测试结果
线下集群
运行环境
Driver Mem: 8G, 50Core, Executor Mem: 20G, Node Count: 5
数据操作性能
数据量 | 表名 | 文件大小 | 写入耗时 | Merge | 合表1 | 合表2 |
1000万 | contact01 | 1.1G | 5min | 26s | 7.7s | 9.6s |
1亿 | contact | 11.1G | 83.3min | 4.3min | 1.3min | 1.2min |
10亿 | contact10 | 110.7G | 8.3h | 44min | 8.5min | 4.7min |
查询性能(Spark 1.3.1, 2015-05-18)
数据量 | Count | GroupBy | Distinct | Count_Distinct | Where | UDF |
1000万 | 1s | 2s | 2.1s | 6.6s | 5s | 4s |
1亿 | 4.6s | 9.2s | 9.2s | 43s | 36s | 29s |
10亿 | 1.2min | 1.4min | 1.2min | 7.6min | 5.6min | 2.6min |
查询性能(Spark 1.4.0, 2015-07-03)
数据量 | Count | GroupBy | Distinct | Count_Distinct | Where | UDF |
1000万 | 0.49s | 1.3s | 1.36s | 7.26s | 5.33s | 2.74s |
1亿 | 2.48s | 9.06s | 8.48s | 48.58s | 36.94s | 25.9s |
10亿 | 1.2min | 1.3min | 1.3min | 7.4min | 6min | 4.6min |
查询性能(Spark 1.4.1, 2015-07-17)
数据量 | Count | GroupBy | Distinct | Count_Distinct | Where | UDF |
1000万 | 1.81s | 1.45s | 1.19s | 6.66s | 3.75s | 2.63s |
1亿 | 5.88s | 9.01s | 9.65s | 49.06s | 38.42s | 17.62s |
10亿 | 1.26min | 1.4min | 1.4min | 7.6min | 5.9min | 2.8min |
Parquet格式
数据量 | Count | GroupBy | Distinct | Count_Distinct | Where | UDF |
1000万 | 0.48s | 0.81s | 0.83s | 5.49s | 3.2s | 2.13s |
1亿 | 0.63s | 1.26s | 1.41s | 34.52s | 15.32s | 2.82s |
10亿 | 2.48s | 7.06s | 8.21s | 315.84s | 142.91s | 20.31s |
查询性能(Spark 1.5.2, 2015-11-26)
Parquet
数据量 | Count | GroupBy | Distinct | Count_Distinct | Where | UDF |
1000万 | 0.5s | 0.81s | 0.83s | 5.3s | 3.04s | 1.55s |
1亿 | 0.58s | 1.29s | 1.23s | 30.71s | 14.32s | 2.92s |
10亿 | 1.76s | 7.32s | 7.35s | 259.33s | 122.62s | 18.71s |
关闭tungsten
数据量 | Count | GroupBy | Distinct | Count_Distinct | Where | UDF |
1000万 | 0.63s | 0.99s | 0.77s | 6.02s | 4.17s | 1.88s |
1亿 | 0.67s | 1.35s | 1.34s | 12.12s | 6.56s | 9.52s |
10亿 | 2.63s | 8.03s | 7.69s | 63.57s | 35.58s | 42.77s |
查询性能(Spark 1.6.0, 2016-01-07)
Parquet
数据量 | Count | GroupBy | Distinct | Count_Distinct | Where | UDF |
1000万 | 0.63s | 0.99s | 0.77s | 6.02s | 4.17s | 1.88s |
1亿 | 0.67s | 1.35s | 1.34s | 12.12s | 6.56s | 9.52s |
10亿 | 2.63s | 8.03s | 7.69s | 63.57s | 35.58s | 42.77s |
查询性能(Spark 1.6.1, 2016-05-19)
数据量 | Count | GroupBy | Distinct | Count_Distinct | Where | UDF |
1000万 | 0.34s | 0.6s | 0.59s | 3.32s | 2.17s | 1.54s |
1亿 | 0.54s | 1.14s | 1.16s | 8.48s | 6.66s | 3.51s |
10亿 | 2.77s | 7.85s | 7.87s | 73.42s | 42.96s | 29.31s |
查询性能(Spark 1.6.2, 2016-07-09)
数据量 | Count | GroupBy | Distinct | Count_Distinct | Where | UDF |
1000万 | 0.42s | 0.56s | 0.48s | 2.37s | 1.39s | 0.82s |
1亿 | 0.56s | 0.98s | 0.96s | 8.28s | 4.77s | 2.87s |
10亿 | 2.55s | 6.27s | 5.85s | 67.11s | 35.9s | 20.66s |
查询性能(Spark 1.6.2, 2016-09-27)
数据量 | Count | GroupBy | Distinct | Count_Distinct | Where | UDF |
1000万 | 0.4s | 0.57s | 0.43s | 10.17s | 4.83s | 2.94s |
1亿 | 0.39s | 0.89s | 0.91s | 31.53s | 13.6s | 10.57s |
10亿 | 1.83s | 4.62s | 4.81s | 67.9s | 28.55s | 64.32s |
查询性能(Spark 2.1.0, 2017-01-02)
数据量 | Count | GroupBy | Distinct | Count_Distinct | Where | UDF |
1000万 | 0.67s | 0.5s | 0.46s | 2.76s | 1.47s | 1.6s |
1亿 | 0.24s | 0.73s | 0.69s | 7.74s | 5.05s | 5.66s |
10亿 | 0.65s | 3.0s | 2.97s | 46.51s | 25.45s | 50.23s |
查询性能(Spark 2.2.0, 2017-07-09)
数据量 | Count | GroupBy | Distinct | Count_Distinct | Where | UDF |
1000万 | 0.21s | 0.33s | 0.33s | 1.88s | 1.08s | 1.7s |
1亿 | 0.22s | 0.71s | 0.6s | 6.84s | 4.31s | 5.56s |
10亿 | 0.46s | 3.43s | 3.29s | 50.38s | 25.55s | 43.93s |
线下集群
运行环境
Driver Mem: 8G, 50Core, Executor Mem: 20G, Node Count: 5
查询性能(Spark 2.2.0, 2020-07-24)
数据量 | Count | GroupBy | Distinct | Count_Distinct | Where | UDF |
1000万 | 5.67s | 2.89s | 0.83s | 8.74s | 3.88s | 5.54s |
1亿 | 1.3s | 0.32s | 0.36s | 0.48s | 0.4s | 0.38s |
10亿 | 1.08s | 3.0s | 2.57s | 250.25s | 116.11s | 48.98s |
查询性能(Spark 2.3.0, 2020-07-24)
数据量 | Count | GroupBy | Distinct | Count_Distinct | Where | UDF |
1000万 | 3.02s | 1.22s | 0.45s | 8.22s | 3.19s | 2.89s |
1亿 | 0.23s | 0.2s | 0.2s | 0.27s | 0.26s | 0.3s |
10亿 | 0.59s | 3.54s | 3.4s | 270.14s | 111.44s | 49.32s |
查询性能(Spark 2.4.6, 2020-07-24)
数据量 | Count | GroupBy | Distinct | Count_Distinct | Where | UDF |
1000万 | 0.41s | 0.51s | 0.48s | 8.29s | 3.07s | 3.3s |
1亿 | 0.2s | 0.21s | 0.22s | 0.25s | 0.24s | 0.23s |
10亿 | 0.44s | 3.26s | 3.43s | 257.84s | 113.04s | 54.31s |
线上集群(bdp-192)
运行环境
Driver Mem: 25G, 216Core, Executor Mem: 18G, Node Count: 5
查询性能(Spark 2.3.0, 2020-08-4)
数据量 | Count | GroupBy | Distinct | Count_Distinct | Where | UDF |
1000万 | 0.54s | 0.49s | 0.35s | 1.32s | 0.97s | 1.01s |
1亿 | 1.02s | 0.98s | 0.76s | 3.65s | 2.42s | 3.71s |
10亿 | 3.53s | 2.9s | 2.79s | 17.37s | 11.25s | 27.25s |
查询性能(Spark 2.4.6, 2020-08-04)
数据量 | Count | GroupBy | Distinct | Count_Distinct | Where | UDF |
1000万 | 0.58s | 0.51s | 0.37s | 1.33s | 0.92s | 0.73s |
1亿 | 0.45s | 0.85s | 0.81s | 3.77s | 2.68s | 4.89s |
10亿 | 1.28s | 2.87s | 2.79s | 18.12s | 12.13s | 33.11s |
新集群 性能测试
spark-extra.conf
spark.scheduler.mode FAIR
spark.executor.extraJavaOptions -Dtag=mobius.query.spi-wdb8 -XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -XX:PermSize=256M -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:G1HeapRegionSize=1m -Xloggc:./gc.log -verbose:gc
spark.memory.fraction 0.8
spark.memory.storageFraction 0.2
spark.speculation.quantile 0.92
spark.speculation.multiplier 2
spark.speculation.interval 200ms
spark.executor.cores 36
spark.driver.maxResultSize 2048m
spark.sql.shuffle.partitions 180
spark.speculation true
spark.kryoserializer.buffer.max 512m
spark.shuffle.consolidateFiles true
spark.sql.autoBroadcastJoinThreshold -1
spark.sql.mergeSchema.parallelize 2
spark.locality.wait 100
spark.cleaner.ttl 3600
spark.default.parallelism 180
spark.sql.adaptive.enabled true
spark.sql.adaptive.shuffle.targetPostShuffleInputSize 1024880
spark.sql.adaptive.minNumPostShufflePartitions 4
spark.ui.retainedJobs 1000
spark.ui.retainedStages 2000
spark.sql.crossJoin.enabled true
新机器 spark 1.6.2 5节点共180 cores,每节点36 cores 18G memory, driver 10 cores, 25G memory 2016-09-30
数据量 | Count | GroupBy | Distinct | Count_Distinct | Where | UDF |
1000万 | 0.23s | 0.31s | 0.31s | 1.11s | 0.82s | 1.21s |
1亿 | 0.99s | 1.05s | 0.87s | 4.94s | 3.25s | 1.89s |
10亿 | 1.2s | 2.93s | 3.36s | 22.51s | 12.39s | 8.27s |
新机器 spark 2.2.0 5节点共180 cores,每节点36 cores 18G memory, driver 10 cores, 25G memory 2017-11-01
数据量 | Count | GroupBy | Distinct | Count_Distinct | Where | UDF |
1000万 | 0.48s | 0.53s | 0.34s | 1.37s | 0.61s | 0.82s |
1亿 | 0.45s | 0.98s | 0.56s | 4.44s | 2.35s | 3.77s |
10亿 | 0.61s | 1.51s | 1.23s | 18.41s | 10.74s | 26.82s |
线上集群
运行环境
Driver Mem: 25G, 40Core, Executor Mem: 4G, Node Count: 13
数据操作性能
数据量 | 表名 | 文件大小 | 写入耗时 | Merge | 合表1 | 合表2 |
1000万 | contact01 | 1.1G | 5min | 27s | 7.2s | 4.4s |
1亿 | contact | 11.1G | 83.3min | 4min | 63s | 40s |
10亿 | contact10 | 110.7G | 8.3h | 54min | 10min | 5min |
查询性能(Spark1.3.1, 13个节点,2015-05-18)
数据量 | Count | GroupBy | Distinct | Count_Distinct | Where | UDF |
1000万 | 1.6s | 1.7s | 1.4s | 5s | 3.8s | 2.1s |
1亿 | 2.6s | 6.7s | 6.8s | 50s | 29s | 12s |
10亿 | 32s | 52s | 51s | 7.4min | 4.5min | 99s |
查询性能(Spark1.3.1, 51台, 2015-06-06)
Driver Mem: 25G, 150Core, Executor Mem: 6G, Node Count: 51
数据量 | Count | GroupBy | Distinct | Count_Distinct | Where | UDF |
1000万 | 0.8s | 1.5s | 1.4s | 4.6s | 3.2s | 1.8s |
1亿 | 1.8s | 4s | 3.2s | 35s | 14s | 4.3s |
10亿 | 8s | 16s | 17s | 4.4min | 2.1min | 27s |
查询性能(Spark1.6.1, 59台, 2016-05-13)
Driver Mem: 25G, 150Core, Executor Mem: 6G
数据量 | Count | GroupBy | Distinct | Count_Distinct | Where | UDF |
1000万 | 0.62s | 0.75s | 0.78s | 2.5s | 1.88s | 0.92s |
1亿 | 0.66s | 1.52s | 1.32s | 6.2s | 4.85s | 3.57s |
10亿 | 2.5s | 3.64s | 3.26s | 17.75s | 9.67s | 19.15s |
create table contact50w as SELECT * FROM contact01 TABLESAMPLE (500000 ROWS)
Spark并发性能测试
测试SQL: select blood_group from contact group by blood_group
对线上集群进行测试,Spark节点数据:13。
测试数据如下, 耗时/ms:
并发查询数 | 最小 | 平均 | 最大 |
1 | 7434 | 7499 | 8197 |
5 | 7557 | 10698 | 16514 |
10 | 18843 | 21328 | 37128 |
20 | 35113 | 39487 | 42407 |
实时join查询性能测试
测试方法: ab -n 20 -c 1 [url] , 执行20次查询请求,并发数为1,取平均
测试两张表进行left join,执行SELECT a,SUM(b) FROM temp GROUP BY a LIMIT 10000,对比表实体化与否的查询性能。
数据量指的是join后的数据量。
数据量 | 未实体化 | 实体化 |
1000 | 450.216ms | 531.483ms |
20W | 1282.255ms | 674.580ms |
1000W | 6869.201ms | 2634.941ms |
BDP-Benchmark
使用 tassadar 的 tools 下的 bdp_benchmark.py 脚本,往 bdp-192 上的 mobius 依次发请求,获得的结果。
Spark 2.2.0
数据量 | LinkRelativeRatio | DateDistinct | Filter | Count | SumByDay | DayOnDayBasis |
1000万 | 1.7s | 0.75s | 1.17s | 0.43s | 1.05s | 1.19s |
1亿 | 3.03s | 2.36s | 2.81s | 0.41s | 2.41s | 2.54s |
10亿 | 17.74s | 15.76s | 16.96s | 1.79s | 16.89s | 17.17s |
Spark 1.6.2
数据量 | LinkRelativeRatio | DateDistinct | Filter | Count | SumByDay | DayOnDayBasis |
1000万 | 1.7s | 0.86s | 1.6s | 0.49s | 1.08s | 1.5s |
1亿 | 5.49s | 2.29s | 5.49s | 0.82s | 2.36s | 4.12s |
10亿 | 30.85s | 16.74s | 34.36s | 2.52s | 17.84s | 30.35s |