1. 介绍

Docker的实现基本依赖于Linux内核的两大特性,分别是namespaces以及cgroups,根据目前的研究可知,cgroup作为轻量级资源隔离工具,它的一些实现导致高密场景下容器的并发瓶颈。

首先我们启动1个docker分析一下资源消耗,这里的容器镜像选择hello-world,分析工具选择perf,操作系统为centos 4.19.91-26.al7.x86_64,104核,190G内存

stat docker run hello-world

Performance counter stats for 'docker run hello-world':

26.52 msec task-clock # 0.099 CPUs utilized
1,227 context-switches # 0.046 M/sec
5 cpu-migrations # 0.189 K/sec
772 page-faults # 0.029 M/sec
81,587,043 cycles # 3.076 GHz
101,426,469 instructions # 1.24 insn per cycle
21,366,561 branches # 805.572 M/sec
321,242 branch-misses # 1.50% of all branches

0.267796817 seconds time elapsed

0.014037000 seconds user
0.017277000

可以看到,资源消耗不高,docker启动+执行时间也非常快0.26s,其中内核态执行时间0.017s,用户态执行时间0.014s

2.并发测试

接下来通过shell脚本,对docker进行并发启动测试,并发线程数设置为100

  1. 并发量为50时:
$ perf stat ./docker_creater.sh hello-world 50
[#######################################] done 所有容器创建完成
Create pharse finished, cost 6035 ms

Performance counter stats for './docker_creater.sh hello-world':

1,786.60 msec task-clock # 0.296 CPUs utilized
63,860 context-switches # 0.036 M/sec
2,297 cpu-migrations # 0.001 M/sec
64,895 page-faults # 0.036 M/sec
5,540,419,196 cycles # 3.101 GHz
5,970,268,108 instructions # 1.08 insn per cycle
1,289,677,965 branches # 721.860 M/sec
18,658,466 branch-misses # 1.45% of all branches

6.045650340 seconds time elapsed

0.973085000 seconds user
0.994541000
  1. 并发量为500时:
$ perf stat ./docker_creater.sh hello-world 500
[#######################################] done 所有容器创建完成
Create pharse finished, cost 65616 ms

Performance counter stats for './docker_creater.sh hello-world 500':

18,233.66 msec task-clock # 0.278 CPUs utilized
659,173 context-switches # 0.036 M/sec
28,407 cpu-migrations # 0.002 M/sec
659,799 page-faults # 0.036 M/sec
56,510,215,983 cycles # 3.099 GHz
60,901,015,964 instructions # 1.08 insn per cycle
13,148,024,010 branches # 721.085 M/sec
187,397,573 branch-misses # 1.43% of all branches

65.625556778 seconds time elapsed

9.658451000 seconds user
10.777222000
  1. 并发量为1000时:
$ perf stat ./docker_creater.sh hello-world 1000
[#######################################] done 所有容器创建完成
Create pharse finished, cost 139439 ms

Performance counter stats for './docker_creater.sh hello-world 1000':

36,734.54 msec task-clock # 0.263 CPUs utilized
1,321,846 context-switches # 0.036 M/sec
52,247 cpu-migrations # 0.001 M/sec
1,366,779 page-faults # 0.037 M/sec
113,861,851,101 cycles # 3.100 GHz
122,471,898,117 instructions # 1.08 insn per cycle
26,463,252,329 branches # 720.392 M/sec
375,802,825 branch-misses # 1.42% of all branches

139.449586644 seconds time elapsed

19.408059000 seconds user
23.166960000
  1. 并发量为2000时:
perf stat ./docker_creater.sh hello-world 2000
[#######################################] done 所有容器创建完成
Create pharse finished, cost 375861 ms

Performance counter stats for './docker_creater.sh hello-world 2000':

79,909.95 msec task-clock # 0.213 CPUs utilized
2,888,193 context-switches # 0.036 M/sec
124,893 cpu-migrations # 0.002 M/sec
2,866,336 page-faults # 0.036 M/sec
247,774,410,634 cycles # 3.101 GHz
265,033,546,482 instructions # 1.07 insn per cycle
57,442,931,091 branches # 718.846 M/sec
798,482,816 branch-misses # 1.39% of all branches

375.871530806 seconds time elapsed

42.632677000 seconds user
56.440334000

中间还有部分测试过程省略,最终的结果曲线如下:

对Docker并发瓶颈的分析_sed

从图中可以看出,随着并发量的提升,创建容器总的延迟和CPU运行时间都逐渐增加,但是总的启动时间明显变现出更陡峭的趋势,到后期呈现指数型增长,而CPU时间则趋近于线性增长。可以的得出结论,即使是在CPU水位不高的情况下,容器的启动时间会随着并发量的提升明显增加,所以可以想到在这种情况下,并发延迟大多来自于临界区域的锁等待。

3. 分析时间消耗

我们通过在脚本中进行对cgroups进行创建、加入任务、删除等操作模拟其在容器启动过程时的工作,通过perf进行系统监控

设置脚本并发线程数为50,每个线程串行创建100个cgroups

$ perf record -a -g ./create_cgroup -t 50 -n 100
$ perf report
Samples: 676K of event 'cycles:ppp', Event count (approx.): 428751413843
Children Self Command Shared Object Symbol
+ 83.53% 0.00% cgroup_test [kernel.kallsyms] [k] entry_SYSCALL_64_after_hw◆
+ 83.53% 0.01% cgroup_test [kernel.kallsyms] [k] do_syscall_64 ▒
+ 57.66% 57.06% cgroup_test [kernel.kallsyms] [k] osq_lock ▒
+ 47.17% 0.00% cgroup_test libpthread-2.17.so [.] start_thread ▒
+ 47.17% 0.00% cgroup_test cgroup_test [.] func ▒
+ 47.16% 0.00% cgroup_test cgroup_test [.] single_test ▒
+ 47.15% 0.00% cgroup_test libc-2.17.so [.] __GI___mkdir ▒
+ 47.14% 0.00% cgroup_test [kernel.kallsyms] [k] do_mkdirat ▒
+ 44.01% 0.04% cgroup_test [kernel.kallsyms] [k] __mutex_lock.isra.12 ▒
+ 30.43% 0.00% cgroup_test [unknown] [k] 0000000000000000
+ 30.23% 0.00% cgroup_test [kernel.kallsyms] [k] filename_create ▒
+ 30.14% 0.01% cgroup_test [kernel.kallsyms] [k] rwsem_optimistic_spin ▒
+ 30.13% 0.01% cgroup_test [kernel.kallsyms] [k] down_write ▒
+ 30.12% 0.00% cgroup_test [kernel.kallsyms] [k] call_rwsem_down_write_fai▒
+ 30.12% 0.00% cgroup_test [kernel.kallsyms] [k] rwsem_down_write_failed ▒
+ 30.01% 0.00% cgroup_test libc-2.17.so [.] __GI___libc_write ▒
+ 30.00% 0.00% cgroup_test [kernel.kallsyms] [k] ksys_write ▒
+ 30.00% 0.00% cgroup_test [kernel.kallsyms] [k] vfs_write ▒
+ 29.99% 0.00% cgroup_test [kernel.kallsyms] [k] kernfs_fop_write ▒
+ 29.98% 0.00% cgroup_test [kernel.kallsyms] [k] cgroup_file_write ▒
+ 27.55% 0.01% cgroup_test [kernel.kallsyms] [k]

展开

-   83.53%     0.00%  cgroup_test      [kernel.kallsyms]               [k] entry_SYSCALL_64_after_hw◆
- 83.53% entry_SYSCALL_64_after_hwframe ▒
- do_syscall_64 ▒
- 47.14% do_mkdirat ▒
- 30.23% filename_create ▒
- 30.12% down_write ▒
- call_rwsem_down_write_failed ▒
- rwsem_down_write_failed ▒
- 30.09% rwsem_optimistic_spin ▒
22.95% osq_lock ▒
7.14% rwsem_spin_on_owner ▒
- 16.85% vfs_mkdir ▒
- 16.85% kernfs_iop_mkdir ▒
- 16.84% cgroup_mkdir ▒
- 12.15% cgroup_kn_lock_live ▒
- 12.15% __mutex_lock.isra.12 ▒
7.75% osq_lock ▒
4.40% mutex_spin_on_owner ▒
- 4.45% cgroup_apply_control_enable ▒
+ 2.21% online_css ▒
+ 0.84% cpu_cgroup_css_alloc ▒

对Docker并发瓶颈的分析_sed_02

通过如上的火焰图,绝大多数时间都消耗在了osq_lock以及mutex_spin_on_owner,其调用的来源则主要是cgroup文件系统的锁操作。