perf 是Linux的一款性能分析工具,能够进行函数级和指令级的热点查找,可以用来分析程序中热点函数的CPU占用率,从而定位性能瓶颈。
Performance analysis tools for Linux.
系统性能优化通常可以分为两个阶段:性能分析和性能优化。
- 性能分析的目的是查找性能瓶颈、热点代码,分析引发性能问题的原因,包括评估程序对硬件资源的使用情况,例如各级cache的访问次数,各级cache的丢失次数、流水线停顿周期、前端总线访问次数等。评估程序对操作系统资源的使用情况,系统调用次数、上下文切换次数、任务迁移次数等。
- 基于性能分析,可以进行性能优化,包括:算法优化(空间复杂度和时间复杂度的权衡)和代码优化(提高执行速度、减少内存占用)。
1.perf的原理
Linux性能计数器是一个基于内核的子系统,它提供一个性能分析框架,比如硬件(CPU、PMU(Performance Monitoring Unit))功能和软件(软件计数器、tracepoint)功能,tracepoints是预埋在内核源码中的一些hook,它们能够在特定的代码被执行到时触发,这一特定能够被各类trace/debug工具所使用。perf将tracepoint产生的时间记录下来,生成报告,这些tracepint的对应的sysfs节点在/sys/kernel/debug/tracing/events目录下。
通过perf,应用程序可以利用PMU、tracepoint和内核中的计数器来进行性能统计。
Perf 可以对众多软硬件事件(程序进行到函数级别)采样,还能采集出跟踪点(trace points)的信息(比如系统调用、TCP/IP事件和文件系统操作,从而了解程序的性能瓶颈在哪里。其基本原理是:每隔一个固定时间,就是CPU上产生一个中断,看当前是哪个进程、哪个函数,然后给对应的进程和函数加一个统计值,这样就知道CPU有多少时间在某个进程或某个函数上了。perf的代码和Linux内核代码放在一起,是内核级的工具。perf是在Linux上做剖析分析的首选工具。
2.perf命令介绍
perf支持的命令如下
root: /5g_build/5g_Main/$ perf -h
usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]
The most commonly used perf commands are:
annotate Read perf.data (created by perf record) and display annotated code
archive Create archive with object files with build-ids found in perf.data file
bench General framework for benchmark suites
buildid-cache Manage build-id cache.
buildid-list List the buildids in a perf.data file
c2c Shared Data C2C/HITM Analyzer.
config Get and set variables in a configuration file.
data Data file related processing
diff Read perf.data files and display the differential profile
evlist List the event names in a perf.data file
ftrace simple wrapper for kernel's ftrace functionality
inject Filter to augment the events stream with additional information
kallsyms Searches running kernel for symbols
kmem Tool to trace/measure kernel memory properties
kvm Tool to trace/measure kvm guest os
list List all symbolic event types
lock Analyze lock events
mem Profile memory accesses
record Run a command and record its profile into perf.data
report Read perf.data (created by perf record) and display the profile
sched Tool to trace/measure scheduler properties (latencies)
script Read perf.data (created by perf record) and display trace output
stat Run a command and gather performance counter statistics
test Runs sanity tests.
timechart Tool to visualize total system behavior during a workload
top System profiling tool.
probe Define new dynamic tracepoints
trace strace inspired tool
See 'perf help COMMAND' for more information on a specific command.
annotate:读取 perf.data(由 perf record记录)并显示带注释的代码,需要在编译应用程序时加入-g选项
archive:用perf.data文件中找到的build-ids的对象文件创建档案。
bench:对系统调度、内存访问、epoll、Futex等进行压力测试。
buildid-cache: 管理build-id缓存
buildid-list:列出perf.data文件中的buildids。
c2c:共享数据C2C/HITM分析仪。
config:读取或设置配置文件中的变量
data:数据文件相关处理
diff: 读取perf.data文件并显示差分曲线
ftrace:内核的ftrace功能的简单封装器
inject:用额外的信息来增加事件流的过滤器
kallsyms:搜索运行中的内核中的符号
kmem:追踪/测量内核内存属性的工具
kvm: 追踪/测量kvm客户操作系统的工具
list:列出所有象征性的事件类型
lock:分析锁事件
mem:分析内存访问
record:将所有的分析记录进perf.data
report:读取perf.data(由perf记录创建)并显示概况
sched:跟踪/测量调度器属性(延迟)的工具
script: 读取perf.data(由perf记录创建)并显示跟踪输出
stat:运行一个命令并收集性能计数器的统计数据
test:测试系统内核支持的功能
timechart:在工作负载期间可视化整个系统行为的工具
top:系统分析工具
probe:定义新的动态跟踪点
trace:strace启发的工具
perf list
列出所有能够触发perf采样点的事件:
List of pre-defined events (to be used in -e):
alignment-faults [Software event]
context-switches OR cs [Software event]
cpu-clock [Software event]
cpu-migrations OR migrations [Software event]
dummy [Software event]
emulation-faults [Software event]
major-faults [Software event]
minor-faults [Software event]
page-faults OR faults [Software event]
task-clock [Software event]
msr/pperf/ [Kernel PMU event]
msr/smi/ [Kernel PMU event]
msr/tsc/ [Kernel PMU event]
power/energy-cores/ [Kernel PMU event]
power/energy-pkg/ [Kernel PMU event]
power/energy-ram/ [Kernel PMU event]
rNNN [Raw hardware event descriptor]
cpu/t1=v1[,t2=v2,t3 ...]/modifier [Raw hardware event descriptor]
(see 'man perf-list' on how to encode it)
mem:<addr>[/len][:access] [Hardware breakpoint]
block:block_bio_backmerge [Tracepoint event]
block:block_bio_bounce [Tracepoint event]
block:block_bio_complete [Tracepoint event]
block:block_bio_frontmerge [Tracepoint event]
block:block_bio_queue [Tracepoint event]
block:block_bio_remap [Tracepoint event]
block:block_dirty_buffer [Tracepoint event]
block:block_getrq [Tracepoint event]
block:block_plug [Tracepoint event]
block:block_rq_abort [Tracepoint event]
主要有三类:
- Hardware Event: 是由PMU硬件产生的事件,比如 cache 命中
- Software Event:是内核软件产生的事件,比如进程切换,tick 数等
- Tracepoint event:是内核中的静态 tracepoint 所触发的事件
perf stat
采集程序的运行时间和CPU开销,perf stat支持的主要参数:
-a, --all-cpus system-wide collection from all CPUs
-A, --no-aggr disable CPU count aggregation
-B, --big-num print large numbers with thousands' separators
-C, --cpu <cpu> list of cpus to monitor in system-wide
-D, --delay <n> ms to wait before starting measurement after program start (-1: start with events disabled)
-d, --detailed detailed run - start a lot of events
-e, --event <event> event selector. use 'perf list' to list available events
-G, --cgroup <name> monitor event in cgroup name only
-g, --group put the counters into a counter group
-I, --interval-print <n>
print counts at regular interval in ms (overhead is possible for values <= 100ms)
-i, --no-inherit child tasks do not inherit counters
-M, --metrics <metric/metric group list>
monitor specified metrics or metric groups (separated by ,)
-n, --null null run - dont start any counters
-o, --output <file> output file name
-p, --pid <pid> stat events on existing process id
-r, --repeat <n> repeat command and print average + stddev (max: 100, forever: 0)
-S, --sync call sync() before starting a run
-t, --tid <tid> stat events on existing thread id
-T, --transaction hardware transaction statistics
-v, --verbose be more verbose (show counter open errors, etc)
使用命令perf stat -p <pid>
可以采集某一个进程的运行信息,<pid>
可以通过top
工具获得:
top - 16:51:29 up 24 days, 1 min, 2 users, load average: 0.48, 0.68, 1.00
Tasks: 26 total, 2 running, 24 sleeping, 0 stopped, 0 zombie
%Cpu(s): 13.6 us, 1.2 sy, 0.0 ni, 85.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 32763512 total, 14078456 used, 18685056 free, 842136 buffers
KiB Swap: 0 total, 0 used, 0 free. 10791840 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
31404 root 20 0 19988 892 684 R 100.0 0.003 0:08.62 test
16506 root 20 0 636592 26840 16040 S 1.333 0.082 2:44.31 node
1103 root 20 0 1010144 42180 15692 S 0.667 0.129 1:37.42 node
1072 root 20 0 998896 104500 18832 S 0.333 0.319 1:19.19 node
1 root 20 0 26864 2044 1636 S 0.000 0.006 0:00.05 code-server-x86
1030 root 20 0 73888 3640 2872 S 0.000 0.011 0:00.00 sshd
1033 root 20 0 35836 2216 1732 S 0.000 0.007 0:00.00 rsyncd
1035 root 20 0 26864 932 528 S 0.000 0.003 0:00.00 code-server-x86
1036 root 20 0 205680 32388 6180 S 0.000 0.099 0:11.50 python3
1038 root 20 0 129800 31800 5612 S 0.000 0.097 0:11.43 python3
1042 root 20 0 723032 36572 16052 S 0.000 0.112 0:01.49 node
16465 root 20 0 1072336 163008 23064 S 0.000 0.498 0:48.81 node
16476 root 20 0 896504 98148 15460 S 0.000 0.300 0:13.43 node
16512 root 20 0 659364 84108 18608 S 0.000 0.257 0:03.60 node
16514 root 20 0 616684 23924 14844 S 0.000 0.073 0:01.01 node
16537 root 20 0 607840 29452 17456 S 0.000 0.090 0:00.97 node
16543 root 20 0 605776 25336 14788 S 0.000 0.077 0:00.91 node
16550 root 20 0 605728 27440 14792 S 0.000 0.084 0:00.91 node
16557 root 20 0 606312 25904 14848 S 0.000 0.079 0:00.94 node
27352 root 20 0 1595104 483140 7972 S 0.000 1.475 5:00.23 cpptools
29068 root 20 0 4902488 8660 3988 S 0.000 0.026 0:00.21 cpptools-srv
29529 root 20 0 30560 6060 2032 S 0.000 0.018 0:00.04 bash
31411 root 20 0 30456 5968 1940 S 0.000 0.018 0:00.03 bash
31455 root 20 0 28764 1740 1280 R 0.000 0.005 0:00.00 top
31475 root 20 0 26860 1972 1588 S 0.000 0.006 0:00.00
31479 root 20 0 21564 964 764 S 0.000 0.003 0:00.00 sleep
root: /5g_build/5g_Main/$ perf stat -p 31404
^C
Performance counter stats for process id '31404':
18958.321499 task-clock (msec) # 1.000 CPUs utilized
28 context-switches # 0.001 K/sec
6 cpu-migrations # 0.000 K/sec
0 page-faults # 0.000 K/sec
<not supported> cycles
<not supported> instructions
<not supported> branches
<not supported> branch-misses
18.959020937 seconds time elapsed
- Task-clock(msec):CPU利用率,该值高,说明程序的多数时间花费在 CPU 计算上而非 IO
- Context-switches:进程切换次数,记录了程序运行过程中发生了多少次进程切换,频繁的进程切换是应该避免的
- Cache-misses:程序运行过程中总体的 cache 利用情况,如果该值过高,说明程序的 cache 利用不好
- CPU-migrations:表示进程
31404
运行过程中发生了多少次 CPU 迁移,即被调度器从一个 CPU 转移到另外一个 CPU 上运行 - Cycles:处理器时钟,一条机器指令可能需要多个 cycles
- Instructions: 机器指令数目
- IPC:是 Instructions/Cycles 的比值,该值越大越好,说明程序充分利用了处理器的特性
- Cache-references: cache命中的次数
- Cache-misses: cache失效的次数
- Page-faults:缺页异常的次数。当应用程序请求的页面尚未建立、请求的页面不在内存中,或者请求的页面虽然在内存中,但物理地址和虚拟地址的映射关系尚未建立时,都会触发一次缺页异常。另外TLB不命中,页面访问权限不匹配等情况也会触发缺页异常。
- Branches:遇到的分支指令数
- Branch-misses:预测错误的分支指令数
perf record
剖析采样可以帮助我们采集到程序运行的特征,而且剖析精度非常高,可以定位到具体的代码行和指令块,perf record支持的参数如下:
-a, --all-cpus system-wide collection from all CPUs
-b, --branch-any sample any taken branches
-B, --no-buildid do not collect buildids in perf.data
-c, --count <n> event period to sample
-C, --cpu <cpu> list of cpus to monitor
-d, --data Record the sample addresses
-D, --delay <n> ms to wait before starting measurement after program start (-1: start with events disabled)
-e, --event <event> event selector. use 'perf list' to list available events
-F, --freq <freq or 'max'>
profile at this frequency
-g enables call-graph recording
-G, --cgroup <name> monitor event in cgroup name only
-I, --intr-regs[=<any register>]
sample selected machine registers on interrupt, use '-I?' to list register names
-i, --no-inherit child tasks do not inherit counters
-j, --branch-filter <branch filter mask>
branch stack filter modes
-k, --clockid <clockid>
clockid to use for events, see clock_gettime()
-m, --mmap-pages <pages[,pages]>
number of mmap data pages and AUX area tracing mmap pages
-N, --no-buildid-cache
do not update the buildid cache
-n, --no-samples don't sample
-o, --output <file> output file name
-P, --period Record the sample period
-p, --pid <pid> record events on existing process id
-q, --quiet don't print any message
-R, --raw-samples collect raw sample records from all opened counters
-r, --realtime <n> collect data with this RT SCHED_FIFO priority
-S, --snapshot[=<opts>]
AUX area tracing Snapshot Mode
-s, --stat per thread counts
-t, --tid <tid> record events on existing thread id
-T, --timestamp Record the sample timestamps
-u, --uid <user> user to profile
-v, --verbose be more verbose (show counter open errors, etc)
通过record
命令perf
会把记录的数据存储在perf.data
中,然后用report
命令查看:
root: /5g_build/5g_Main/$ perf record -F 1000 -p 31404
以1kHz的采样率对进程31404进行采样
如果想关心perf的具体某个事件,希望按某个事件来进行统计则可以通过-e来指定某个事件
使用tracepoint
的基本需求是对内核的运行时行为的关心
perf record -e raw_syscalls:sys_enter ./test
列出系统调用次数,查看系统调用主要发生在哪里:
Samples: 144 of event 'raw_syscalls:sys_enter', Event count (approx.): 144
Overhead Trace output
5.56% NR 9 (0, 1000, 3, 22, ffffffff, 0)
2.08% NR 3 (3, 1a39e, 1, 2, 3, 0)
2.08% NR 9 (0, 1a39e, 1, 2, 3, 0)
1.39% NR 0 (3, 7f661ec89000, 1000, 22, ffffffff, 0)
0.69% NR 0 (3, 7f661ec89000, 1000, 0, 3d7467786b746674, 7471766b70716f3d)
0.69% NR 0 (3, 7f661ec89000, 1000, 6874206461657220, 7f661ec894a8, 7f661ea64698)
0.69% NR 0 (3, 7ffe58453830, 340, 10eddc0, 80000000, 7ffe58453817)
0.69% NR 0 (3, 7ffe58453840, 340, 10ec700, 80000000, 7ffe58453827)
0.69% NR 0 (3, 7ffe58453ae0, 340, 10ec4f0, 90000001, 7ffe58453ac7)
0.69% NR 0 (3, 7ffe58453b00, 340, 10ed8d0, 90000001, 7ffe58453ae7)
0.69% NR 0 (3, 7ffe58453cc0, 340, 7f661ec8a4b8, 0, 7ffe58453ca7)
0.69% NR 0 (3, 7ffe58453cf0, 340, 7f661ec6ee58, 0, 7ffe58453cd7)
0.69% NR 0 (3, 7ffe58453d20, 340, 7f661ec6e988, 0, 7ffe58453d07)
0.69% NR 0 (3, 7ffe58453e40, 340, 7f661ec6e4b8, 0, 7ffe58453e27)
0.69% NR 0 (3, 7ffe58453f70, 340, 7f661ec8dc90, 4000000, 7ffe58453f57)
0.69% NR 0 (5, 7f661ec89000, 1000, 22, ffffffff, 0)
0.69% NR 0 (5, 7f661ec89000, 1000, 656d616e79622e77, 7f661ec896dd, 6f64616873206568)
0.69% NR 10 (600000, 1000, 1, 7f661ec8d148, 0, 0)
0.69% NR 10 (7f661d868000, 1ff000, 0, 802, 3, 0)
0.69% NR 10 (7f661da67000, 1000, 1, 10eddf0, 7f661ec6e4d0, 7f661ec6e4d0)
0.69% NR 10 (7f661da79000, 200000, 0, 802, 3, 0)
0.69% NR 10 (7f661dc79000, 1000, 1, 10ed8f0, 7f661ec6e4d0, 7f661ec6e4d0)
0.69% NR 10 (7f661dc91000, 1ff000, 0, 802, 3, 0)
0.69% NR 10 (7f661de90000, 1000, 1, 10ed2c0, 7f661ec6e4d0, 7f661ec6e4d0)
0.69% NR 10 (7f661de9b000, 1ff000, 0, 802, 3, 0)
0.69% NR 10 (7f661e09a000, 1000, 1, 10ece10, 7f661ec6e4d0, 7f661ec6e4d0)
0.69% NR 10 (7f661e0a3000, 1ff000, 0, 802, 3, 0)
0.69% NR 10 (7f661e2a2000, 1000, 1, 7f661ec8a4d0, 7f661ec6e4d0, 7f661ec6e4d0)
0.69% NR 10 (7f661e2bc000, 1ff000, 0, 802, 3, 0)
0.69% NR 10 (7f661e4bb000, 1000, 1, 7f661ec8a000, 7f661ec6e4d0, 7f661ec6e4d0)
0.69% NR 10 (7f661e4c3000, 200000, 0, 802, 3, 0)
0.69% NR 10 (7f661e6c3000, 1000, 1, 7f661ec6e9a0, 7f661ec6e4d0, 7f661ec6e4d0)
0.69% NR 10 (7f661e861000, 1ff000, 0, 802, 3, 0)
0.69% NR 10 (7f661ea60000, 4000, 1, 7f661ec6e4d0, 0, 7f661ec8d6e0)
0.69% NR 10 (7f661eb6e000, ff000, 0, 802, 3, 0)
0.69% NR 10 (7f661ec8b000, 1000, 1, 7f661ec8c998, 7f661ec6e4d0, 7f661ec6e4d0)
0.69% NR 102 (7f661eb6c842, 7f661ec6d821, 1, ff87a407, f8f0000000000000, 10f62b0)
0.69% NR 11 (7f661ec6f000, 1a39e, 2a6c, 7f661ec8c998, 7f661ec6e4d0, 7f661ec6e4d0)
0.69% NR 11 (7f661ec6f000, 1a39e, 7ffe58454057, 10ece10, 0, 7f661ec6e4d0)
0.69% NR 11 (7f661ec6f000, 1a39e, 7ffe58454077, 10ed8f0, 0, 7f661ec6e4d0)
0.69% NR 11 (7f661ec89000, 1000, 0, 6874206461657220, 7f661ec89400, 10ec6d0)
0.69% NR 11 (7f661ec89000, 1000, 0, fd09cf21, 10f6160, 7f661eb66700)
0.69% NR 11 (7f661ec89000, 1000, 0, fd09cf21, 7f661ec89600, 10eeab0)
0.69% NR 11 (7f661ec8a000, b, 1ce6e, 3a, 0, 70000021)
0.69% NR 12 (0, 0, 0, 0, 37f, 64)
使用-g选项可以查看到具体某个函数所花费的时间以及函数的调用路径:
Samples: 54K of event 'cpu-clock', Event count (approx.): 13565000000
Children Self Command Shared Object Symbol
+ 99.99% 0.00% test [.] __libc_start_main
+ 99.99% 0.00% test test [.] main
+ 99.99% 99.99% test test [.] print
0.00% 0.00% test [kernel.kallsyms] [k] __do_softirq
0.00% 0.00% test [kernel.kallsyms] [k] ret_from_intr
0.00% 0.00% test [kernel.kallsyms] [k] system_call_fastpath
0.00% 0.00% test [kernel.kallsyms] [k] __irqentry_text_start
0.00% 0.00% test [kernel.kallsyms] [k] irq_exit
0.00% 0.00% test [kernel.kallsyms] [k] do_softirq
0.00% 0.00% test [kernel.kallsyms] [k] call_softirq
0.00% 0.00% test [kernel.kallsyms] [k] scsi_softirq_done
0.00% 0.00% test [kernel.kallsyms] [k] pagevec_lru_move_fn
0.00% 0.00% test [.] _dl_sysdep_start
0.00% 0.00% test [.] munmap
0.00% 0.00% test [kernel.kallsyms] [k] __call_rcu
0.00% 0.00% test [kernel.kallsyms] [k] _raw_qspin_lock
0.00% 0.00% test [kernel.kallsyms] [k] get_empty_filp
0.00% 0.00% test [.] _dl_map_object
0.00% 0.00% test [.] _IO_file_close
0.00% 0.00% test [unknown] [k] 0000000000000000
0.00% 0.00% test [kernel.kallsyms] [k] int_signal
0.00% 0.00% test [kernel.kallsyms] [k] sys_munmap
0.00% 0.00% test [.] _dl_setup_hash
0.00% 0.00% test [.] _IO_file_open
0.00% 0.00% test [kernel.kallsyms] [k] async_page_fault
0.00% 0.00% test [kernel.kallsyms] [k] do_notify_resume
0.00% 0.00% test [kernel.kallsyms] [k] vm_munmap
0.00% 0.00% test [kernel.kallsyms] [k] do_async_page_fault
0.00% 0.00% test [kernel.kallsyms] [k] do_munmap
0.00% 0.00% test [kernel.kallsyms] [k] sys_open
0.00% 0.00% test [kernel.kallsyms] [k] task_work_run
0.00% 0.00% test [kernel.kallsyms] [k] ____fput
0.00% 0.00% test [kernel.kallsyms] [k] do_sys_open
0.00% 0.00% test [kernel.kallsyms] [k] trace_do_page_fault
0.00% 0.00% test [kernel.kallsyms] [k] unmap_region
0.00% 0.00% test [kernel.kallsyms] [k] __do_page_fault
0.00% 0.00% test [kernel.kallsyms] [k] __fput
0.00% 0.00% test [kernel.kallsyms] [k] do_filp_open
0.00% 0.00% test [kernel.kallsyms] [k] lru_add_drain
0.00% 0.00% test [kernel.kallsyms] [k] blk_done_softirq
0.00% 0.00% test [kernel.kallsyms] [k] call_rcu_sched
0.00% 0.00% test [kernel.kallsyms] [k] handle_mm_fault
0.00% 0.00% test [kernel.kallsyms] [k] lru_add_drain_cpu
0.00% 0.00% test [kernel.kallsyms] [k] path_openat
perf report
Usage: perf report [<options>]
-b, --branch-stack use branch records for per branch histogram filling
-c, --comms <comm[,comm...]>
only consider symbols in these comms
-C, --cpu <cpu> list of cpus to profile
-d, --dsos <dso[,dso...]>
only consider symbols in these dsos
-D, --dump-raw-trace dump raw trace in ASCII
-F, --fields <key[,keys...]>
output field(s): overhead, period, sample plus all of sort keys
-f, --force don't complain, do it
-g, --call-graph <print_type,threshold[,print_limit],order,sort_key[,branch],value>
Display call graph (stack chain/backtrace):
print_type: call graph printing style (graph|flat|fractal|folded|none)
threshold: minimum call graph inclusion threshold (<percent>)
print_limit: maximum number of call graph entry (<number>)
order: call graph order (caller|callee)
sort_key: call graph sort key (function|address)
branch: include last branch info to call graph (branch)
value: call graph value (percent|period|count)
Default: graph,0.5,caller,function,percent
-G, --inverted alias for inverted call graph
-i, --input <file> input file name
-I, --show-info Display extended information about perf.data file
-k, --vmlinux <file> vmlinux pathname
-M, --disassembler-style <disassembler style>
Specify disassembler style (e.g. -M intel for intel syntax)
-m, --modules load module symbols - WARNING: use only with -k and LIVE kernel
-n, --show-nr-samples
Show a column with the number of samples
-p, --parent <regex> regex filter to identify parent, see: '--sort parent'
-q, --quiet Do not show any message
-s, --sort <key[,key2...]>
sort by key(s): pid, comm, dso, symbol, parent, cpu, srcline, ... Please refer the man page for the complete list.
-S, --symbols <symbol[,symbol...]>
only consider these symbols
-t, --field-separator <separator>
separator for columns, no spaces will be added between columns '.' is reserved.
-T, --threads Show per-thread event counters
-U, --hide-unresolved
Only display entries resolved to a symbol
-v, --verbose be more verbose (show symbol address, etc)
-w, --column-widths <width[,width...]>
don't try to adjust column width, use these fixed values
-x, --exclude-other Only display entries with parent-match
--asm-raw Display raw encoding of assembly instructions (default)
--branch-history add last branch records to call history
--children Accumulate callchains of children and show total overhead as well
--demangle Disable symbol demangling
--demangle-kernel
Enable kernel symbol demangling
--full-source-path
Show full source file name path for source lines
--group Show event group information together
--gtk Use the GTK2 interface
--header Show data header.
--header-only Show only data header.
--hierarchy Show entries in a hierarchy
--ignore-callees <regex>
ignore callees of these functions in call graphs
--inline Show inline function
--itrace[=<opts>]
Instruction Tracing options
--kallsyms <file>
kallsyms pathname
--max-stack <n> Set the maximum stack depth when parsing the callchain, anything beyond the specified depth will be ignored. Default: kernel.perf_event_max_stack or 127
--mem-mode mem access profile
--objdump <path> objdump binary to use for disassembly and annotations
--percent-limit <percent>
Don't show entries under that percent
--percentage <relative|absolute>
how to display percentage of filtered entries
--pid <pid[,pid...]>
only consider symbols in these pids
--pretty <key> pretty printing style key: normal raw
--raw-trace Show raw trace event output (do not use print fmt or plugins)
--show-cpu-utilization
Show sample percentage for different cpu modes
--show-ref-call-graph
Show callgraph from reference event
--show-total-period
Show a column with the sum of periods
--socket-filter <n>
only show processor socket that match with this filter
--source Interleave source code with assembly code (default)
--stdio Use the stdio interface
--stdio-color <mode>
'always' (default), 'never' or 'auto' only applicable to --stdio mode
--symbol-filter <filter>
only show symbols that (partially) match with this filter
--symfs <directory>
Look for files with symbols relative to this directory
--tid <tid[,tid...]>
only consider symbols in these tids
--time <str> Time span of interest (start,stop)
--tui Use the TUI interface
使用命令perf report -i perf.data > perf.txt
把结果写入文本文件:
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 21K of event 'cpu-clock'
# Event count (approx.): 21067000000
#
# Overhead Command Shared Object Symbol
# ........ ....... ................. ..........................................
#
99.87% test test [.] print
0.04% test [kernel.kallsyms] [k] __do_softirq
0.02% test [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
0.01% test [kernel.kallsyms] [k] find_busiest_group
0.01% test [kernel.kallsyms] [k] run_timer_softirq
0.00% test [kernel.kallsyms] [k] _raw_qspin_lock
0.00% test [kernel.kallsyms] [k] check_for_new_grace_period.isra.26
0.00% test [kernel.kallsyms] [k] iowrite16
0.00% test [kernel.kallsyms] [k] kmem_cache_free
0.00% test [kernel.kallsyms] [k] mod_timer_pending
0.00% test [kernel.kallsyms] [k] nf_ct_delete_from_lists
0.00% test [kernel.kallsyms] [k] packet_rcv
0.00% test [kernel.kallsyms] [k] rcu_process_callbacks
0.00% test [kernel.kallsyms] [k] rcu_process_gp_end
0.00% test [kernel.kallsyms] [k] rebalance_domains
0.00% test [kernel.kallsyms] [k] source_load
#
# (Tip: Save output of perf stat using: perf stat record <target workload>)
#
- Overhead:指出了该Symbol采样在总采样中所占的百分比。在当前场景下,表示了该Symbol消耗的CPU时间占总CPU时间的百分比
- Command:进程名
- Shared Object:模块名, 比如具体哪个共享库,哪个可执行程序
- Symbol:二进制模块中的符号名,如果是高级语言,比如C语言编写的程序,等价于函数名
在perf report
命令下移动光标到某一函数上敲击Enter
键,还可以更详细地查看分析结果:
Annotate print --- 分析print函数中指令或者代码的性能
Zoom into test thread --- 聚焦到线程 test
Zoom into test DSO --- 聚焦到动态共享对象test
Browse map details --- 查看map
Run scripts for samples of thread [test]--- 针对test线程的采样运行脚本
Run scripts for samples of symbol [test] --- 针对函数的采样运行脚本
Run scripts for all samples --- 针对所有采样运行脚步
Switch to another data file in PWD --- 切换到当前目录中另一个数据文件
Exit
annotate
Usage: perf annotate [<options>]
-C, --cpu <cpu> list of cpus to profile
-d, --dsos <dso[,dso...]>
only consider symbols in these dsos
-D, --dump-raw-trace dump raw trace in ASCII
-f, --force don't complain, do it
-i, --input <file> input file name
-k, --vmlinux <file> vmlinux pathname
-l, --print-line print matching source lines (may be slow)
-M, --disassembler-style <disassembler style>
Specify disassembler style (e.g. -M intel for intel syntax)
-m, --modules load module symbols - WARNING: use only with -k and LIVE kernel
-n, --show-nr-samples
Show a column with the number of samples
-P, --full-paths Don't shorten the displayed pathnames
-q, --quiet do now show any message
-s, --symbol <symbol>
symbol to annotate
-v, --verbose be more verbose (show symbol address, etc)
--asm-raw Display raw encoding of assembly instructions (default)
--group Show event group information together
--gtk Use the GTK interface
--objdump <path> objdump binary to use for disassembly and annotations
--show-total-period
Show a column with the sum of periods
--skip-missing Skip symbols that cannot be annotated
--source Interleave source code with assembly code (default)
--stdio Use the stdio interface
--stdio-color <mode>
'always' (default), 'never' or 'auto' only applicable to --stdio mode
--symfs <directory>
Look for files with symbols relative to this directory
--tui Use the TUI interface
可以直接使用annotate
来单独分析函数信息:
perf annotate -l -s <func>
perf top
用于实时显示当前系统的性能统计信息。该命令主要用来观察整个系统当前的状态,比如可以通过查看该命令的输出来查看当前系统最耗时的内核函数或某个用户进程,支持参数如下:
Usage: perf top [<options>]
-a, --all-cpus system-wide collection from all CPUs
-b, --branch-any sample any taken branches
-c, --count <n> event period to sample
-C, --cpu <cpu> list of cpus to monitor
-d, --delay <n> number of seconds to delay between refreshes
-D, --dump-symtab dump the symbol table used for profiling
-E, --entries <n> display this many functions
-e, --event <event> event selector. use 'perf list' to list available events
-f, --count-filter <n>
only display functions with more events than this
-F, --freq <n> profile at this frequency
-g enables call-graph recording and display
-i, --no-inherit child tasks do not inherit counters
-j, --branch-filter <branch filter mask>
branch stack filter modes
-K, --hide_kernel_symbols
hide kernel symbols
-k, --vmlinux <file> vmlinux pathname
-M, --disassembler-style <disassembler style>
Specify disassembler style (e.g. -M intel for intel syntax)
hide kernel symbols
-k, --vmlinux <file> vmlinux pathname
-M, --disassembler-style <disassembler style>
Specify disassembler style (e.g. -M intel for intel syntax)
-m, --mmap-pages <pages>
number of mmap data pages
-n, --show-nr-samples
Show a column with the number of samples
-p, --pid <pid> profile events on existing process id
-r, --realtime <n> collect data with this RT SCHED_FIFO priority
-s, --sort <key[,key2...]>
sort by key(s): pid, comm, dso, symbol, parent, cpu, srcline, ... Please refer the man page for the complete list.
-t, --tid <tid> profile events on existing thread id
-U, --hide_user_symbols
hide user symbols
-u, --uid <user> user to profile
-v, --verbose be more verbose (show counter open errors, etc)
-w, --column-widths <width[,width...]>
don't try to adjust column width, use these fixed values
-z, --zero zero history across updates
--asm-raw Display raw encoding of assembly instructions (default)
--call-graph <record_mode[,record_size],print_type,threshold[,print_limit],order,sort_key[,branch]>
setup and enables call-graph (stack chain/backtrace):
record_mode: call graph recording mode (fp|dwarf|lbr)
record_size: if record_mode is 'dwarf', max size of stack recording (<bytes>)
default: 8192 (bytes)
record_mode: call graph recording mode (fp|dwarf|lbr)
record_size: if record_mode is 'dwarf', max size of stack recording (<bytes>)
default: 8192 (bytes)
record_mode: call graph recording mode (fp|dwarf|lbr)
record_size: if record_mode is 'dwarf', max size of stack recording (<bytes>)
default: 8192 (bytes)
record_mode: call graph recording mode (fp|dwarf|lbr)
record_size: if record_mode is 'dwarf', max size of stack recording (<bytes>)
default: 8192 (bytes)
-z, --zero zero history across updates
record_mode: call graph recording mode (fp|dwarf|lbr)
record_size: if record_mode is 'dwarf', max size of stack recording (<bytes>)
default: 8192 (bytes)
print_type: call graph printing style (graph|flat|fractal|folded|none)
threshold: minimum call graph inclusion threshold (<percent>)
print_limit: maximum number of call graph entry (<number>)
order: call graph order (caller|callee)
sort_key: call graph sort key (function|address)
branch: include last branch info to call graph (branch)
value: call graph value (percent|period|count)
Default: fp,graph,0.5,caller,function
--children Accumulate callchains of children and show total overhead as well
--percent-limit <percent>
Don't show entries under that percent
--percentage <relative|absolute>
How to display percentage of filtered entries
--proc-map-timeout <n>
per thread proc mmap processing timeout in ms
--raw-trace Show raw trace event output (do not use print fmt or plugins)
--show-total-period
Show a column with the sum of periods
--source Interleave source code with assembly code (default)
--stdio Use the stdio interface
--sym-annotate <symbol name>
symbol to annotate
--symbols <symbol[,symbol...]>
only consider these symbols
--tui Use the TUI interface
常用选项:
-e <event>:指明要分析的性能事件。
-p <pid>:Profile events on existing Process ID (comma sperated list). 仅分析目标进程及其创建的线程。
-k <path>:Path to vmlinux. Required for annotation functionality. 带符号表的内核映像所在的路径。
-K:不显示属于内核或模块的符号。
-U:不显示属于用户态程序的符号。
-d <n>:界面的刷新周期,默认为2s,因为perf top默认每2s从mmap的内存区域读取一次性能数据。
-g:得到函数的调用关系图
Samples: 290K of event 'cpu-clock', Event count (approx.): 21181763519
Overhead Shared Object Symbol
74.18% test [.] print
3.76% [kernel] [k] _raw_spin_unlock_irqrestore
1.47% [kernel] [k] __do_softirq
0.70% [kernel] [k] run_timer_softirq
0.55% [kernel] [k] finish_task_switch
0.52% [kernel] [k] __do_page_fault
0.34% [kernel] [k] rcu_process_callbacks
0.31% [kernel] [k] get_vmalloc_info
0.27% [kernel] [k] find_busiest_group
0.17% [kernel] [k] rcu_gp_kthread
0.15% [kernel] [k] free_hot_cold_page
0.15% [kernel] [k] copy_page_rep
0.14% [unknown] [.] 0x00007f0f77aaab69
0.13% [kernel] [k] kmem_cache_free
0.13% [kernel] [k] clear_page_c_e
0.13% [kernel] [k] system_call_after_swapgs
0.12% [kernel] [k] rcu_process_gp_end
0.12% [unknown] [.] 0x00007f0f77aaab6c
0.12% [kernel] [k] _raw_qspin_lock
0.11% [kernel] [k] tick_nohz_idle_enter
0.11% [kernel] [k] copy_user_enhanced_fast_string
0.11% [unknown] [.] 0x00007f0f77aaab7a
0.11% [kernel] [k] unmap_page_range
0.11% [kernel] [k] __slab_free
0.10% [kernel] [k] get_page_from_freelist
0.10% [kernel] [k] load_balance
0.10% [kernel] [k] rebalance_domains
0.10% [.] __pthread_rwlock_unlock
0.10% [kernel] [k] check_for_new_grace_period.isra.26
0.09% rg [.] __pthread_mutex_unlock
0.09% [unknown] [.] 0x00007f0f77aaad26
0.09% [unknown] [.] 0x00007f0f77aaab7d
0.09% [unknown] [.] 0x00007f0f77aaad29
0.09% [kernel] [k] file_free_rcu
0.09% [kernel] [k] free_page_and_swap_cache
0.09% [unknown] [.] 0x00007f0f77aaab76
0.09% [unknown] [.] 0x00007f0f77aaab60
0.09% [kernel] [k] rcu_idle_exit
0.08% [unknown] [.] 0x00007f0f77aaabb6
0.08% [kernel] [k] __find_get_page
0.08% [unknown] [.] 0x00007f0f77aaabcd
- 第一列:符号引发的性能事件的比例,指占用的cpu周期比例
- 第二列:符号所在的DSO(Dynamic Shared Object),可以是应用程序、内核、动态链接库、模块
- 第三列:DSO的类型,[.]表示此符号属于用户态的ELF文件,包括可执行文件与动态链接库;[k]表述此符号属于内核或模块
- 第四列:符号名,有些符号不能解析为函数名,只能用地址表示
perf top命令和linux下的top命令有点相似,实时打印出系统中被采样事件的状态和统计数据。perf top主要用于实时剖析各个函数在某个性能 事件(event)上的热度,默认的event是cycles(cpu周期数),这样可以检测系统中所有应用层和内核层函数的热度。
perf top支持两种输出界面,tui和tty,默认是tui,因为tui需要更多的环境和库支持,所以经常出现乱码问题,所以本文都是基于tty界面分析(–stdio)。
直接执行perf top
监控的是整个系统中所有进程的状态,多数情况我们只关心某个进程,或者想定位某个线程的性能问题,perf top都是支持的(-p / -t)。
需要进入函数内部一探究竟,有时对于像上面的DH_SSM_BLKBUF_ALLOC
这样的函数的调用堆栈,以定位到是哪里在频繁调用。这时候可以执行:
perf top -t 31404 --stdio -g -K
-g
参数就是显示函数的调用堆栈,-k
是为了只输出应用层函数
perf top --call-graph [fractal]
路径概率为相对值,加起来为100%,调用顺序为从下往上
perf top --call-graph graph
路径概率为绝对值,加起来为该函数的热度
关于perf top界面常用命令如下:
h:显示帮助,即可显示详细的帮助信息。
UP/DOWN/PGUP/PGDN/SPACE:上下和翻页。
a:annotate current symbol,注解当前符号。能够给出汇编语言的注解,给出各条指令的采样率。
d:过滤掉所有不属于此DSO的符号。非常方便查看同一类别的符号。
P:将当前信息保存到perf.hist.N中
perf probe
能够动态地在想查看的地方插入动态监测点
perf prob schedule:12 cpu
在内核函数 schedule() 的第 12 行处加入了一个动态 probe 点,和 tracepoint 的功能一样,内核一旦运行到该 probe 点时,便会通知 perf。可以理解为动态增加了一个新的 tracepoint
perf sched
perf sched
提供了许多工具来分析内核CPU调度器的行为,可以用它来识别和量化调度器延迟的问题
- perf sched专门用于跟踪/测量调度器,包括延时等
- perf sched record <command>:记录测试过程中的调度事件
- perf sched latency:报告线程调度延时和其他调度相关属性
- perf sched script:查看执行过程中详细的trace信息
- perf sched replay:回放perf sched record录制的执行过程
- perf sched map:用字符表示打印上下文切换
执行sudo perf sched record ls后,通过不同方式查看结果
perf sched record sleep 10
perf sched latency --sort max
-----------------------------------------------------------------------------------------------------------------
Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at |
-----------------------------------------------------------------------------------------------------------------
:14:14 | 0.323 ms | 10 | avg: 1.417 ms | max: 5.016 ms | max at: 2133786.303716 s
:34:34 | 0.224 ms | 10 | avg: 1.043 ms | max: 5.013 ms | max at: 2133789.922667 s
:9:9 | 11.056 ms | 1653 | avg: 0.012 ms | max: 4.005 ms | max at: 2133795.230575 s
:20834:20834 | 1.977 ms | 10 | avg: 0.737 ms | max: 3.854 ms | max at: 2133792.826450 s
:44:44 | 0.182 ms | 4 | avg: 1.231 ms | max: 3.436 ms | max at: 2133788.327110 s
:4603:4603 | 23.657 ms | 49 | avg: 0.073 ms | max: 3.362 ms | max at: 2133795.242342 s
:24:24 | 0.375 ms | 15 | avg: 0.604 ms | max: 3.015 ms | max at: 2133795.072595 s
:19:19 | 0.330 ms | 13 | avg: 0.517 ms | max: 2.981 ms | max at: 2133792.114617 s
:28878:28878 | 0.112 ms | 10 | avg: 0.859 ms | max: 2.894 ms | max at: 2133793.591518 s
:18203:18203 | 0.307 ms | 25 | avg: 0.264 ms | max: 2.846 ms | max at: 2133792.579451 s
grep:(29) | 0.000 ms | 1650 | avg: 0.015 ms | max: 2.683 ms | max at: 2133792.059059 s
:4600:4600 | 24.257 ms | 47 | avg: 0.060 ms | max: 2.642 ms | max at: 2133795.230570 s
:11084:11084 | 19.280 ms | 1575 | avg: 0.008 ms | max: 2.636 ms | max at: 2133792.059051 s
:4310:4310 | 1.744 ms | 3 | avg: 1.049 ms | max: 2.562 ms | max at: 2133787.860902 s
:4607:4607 | 13.711 ms | 26 | avg: 0.282 ms | max: 2.376 ms | max at: 2133795.244725 s
:11112:11112 | 0.304 ms | 1 | avg: 2.362 ms | max: 2.362 ms | max at: 2133792.058750 s
:4596:4596 | 12.092 ms | 248 | avg: 0.039 ms | max: 2.171 ms | max at: 2133795.238976 s
:4267:4267 | 1.886 ms | 3 | avg: 0.726 ms | max: 2.163 ms | max at: 2133786.389122 s
:20841:20841 | 11.626 ms | 15 | avg: 0.260 ms | max: 2.082 ms | max at: 2133792.815333 s
:20834:20834 | 1.977 ms | 10 | avg: 0.737 ms | max: 3.854 ms | max at: 2133792.826450 s
:44:44 | 0.182 ms | 4 | avg: 1.231 ms | max: 3.436 ms | max at: 2133788.327110 s
:4603:4603 | 23.657 ms | 49 | avg: 0.073 ms | max: 3.362 ms | max at: 2133795.242342 s
:24:24 | 0.375 ms | 15 | avg: 0.604 ms | max: 3.015 ms | max at: 2133795.072595 s
:19:19 | 0.330 ms | 13 | avg: 0.517 ms | max: 2.981 ms | max at: 2133792.114617 s
:28878:28878 | 0.112 ms | 10 | avg: 0.859 ms | max: 2.894 ms | max at: 2133793.591518 s
:18203:18203 | 0.307 ms | 25 | avg: 0.264 ms | max: 2.846 ms | max at: 2133792.579451 s
:4235:4235 | 1.522 ms | 3 | avg: 0.553 ms | max: 1.618 ms | max at: 2133786.185985 s
:4493:4493 | 1.559 ms | 3 | avg: 0.543 ms | max: 1.610 ms | max at: 2133791.087969 s
:20838:20838 | 0.694 ms | 90 | avg: 0.028 ms | max: 1.554 ms | max at: 2133792.814760 s
:4606:4606 | 20.843 ms | 38 | avg: 0.131 ms | max: 1.532 ms | max at: 2133795.221587 s
:21475:21475 | 1.136 ms | 33 | avg: 0.051 ms | max: 1.532 ms | max at: 2133792.814738 s
:511:511 | 10.726 ms | 40 | avg: 0.262 ms | max: 1.521 ms | max at: 2133795.228573 s
ipset:(28) | 0.000 ms | 473 | avg: 0.035 ms | max: 1.482 ms | max at: 2133790.437174 s
:7351:7351 | 37.613 ms | 43 | avg: 0.089 ms | max: 1.468 ms | max at: 2133790.594637 s
:7274:7274 | 43.539 ms | 50 | avg: 0.035 ms | max: 1.468 ms | max at: 2133790.437176 s
:23220:23220 | 0.205 ms | 14 | avg: 0.148 ms | max: 1.448 ms | max at: 2133790.570121 s
:28112:28112 | 18.778 ms | 289 | avg: 0.015 ms | max: 1.417 ms | max at: 2133792.059010 s
- Task:进程的名字和 pid
- Runtime:实际运行时间
- Switches:进程切换的次数
- Average delay:平均的调度延迟
- Maximum delay:最大延迟
perf sched map
perf sched record sleep 10
perf sched map
显示了所有的cpu和上下文事件
*. . . . C0 . 2133785.243718 secs
. *G0 . . C0 . 2133785.244558 secs G0 => :16740:16740
*H0 . G0 . . C0 . 2133785.244561 secs H0 => :24283:24283
H0 . G0 . *I0 C0 . 2133785.244572 secs I0 => :16047:16047
H0 . *. . I0 C0 . 2133785.244582 secs
*. . . . I0 C0 . 2133785.244585 secs
. . . . I0 C0 *J0 2133785.244594 secs J0 => :24282:24282
. . . . *. C0 J0 2133785.244594 secs
. . . . . C0 *. 2133785.244615 secs
. . . . . *. . 2133785.244748 secs
. *A0 . . . . . 2133785.246709 secs
. *. . . . . . 2133785.246713 secs
. *A0 . . . . . 2133785.249709 secs
. *. . . . . . 2133785.249715 secs
. *A0 . . . . . 2133785.252711 secs
. *. . . . . . 2133785.252715 secs
. . . . *K0 . . 2133785.252788 secs K0 => :31257:31257
. . . . *. . . 2133785.252800 secs
. . . . . *L0 . 2133785.257075 secs L0 => :28112:28112
*M0 . . . . L0 . 2133785.257107 secs M0 => :11084:11084
M0 . . . . *. . 2133785.257121 secs
M0 *N0 . . . . . 2133785.257121 secs N0 => :11137:11137
*. N0 . . . . . 2133785.257122 secs
. N0 . . . . . *O0 2133785.257132 secs O0 => :11121:11121
. N0 . . . . . *. 2133785.257145 secs
. N0 *P0 . . . . . 2133785.257158 secs P0 => :11081:11081
. *. P0 . . . . . 2133785.257161 secs
. . *. . . . . . 2133785.257167 secs
*M0 . . . . . . . 2133785.257199 secs
*. . . . . . . . 2133785.257203 secs
. . . . *Q0 . . . 2133785.270713 secs Q0 => :4953:4953
. . . . *. . . . 2133785.270729 secs
*R0 . . . . . . . 2133785.278385 secs R0 => :19879:19879
R0 *S0 . . . . . . 2133785.278427 secs S0 => :10538:10538
R0 S0 *T0 . . . . . 2133785.278430 secs T0 => :10550:10550
*. S0 T0 . . . . . 2133785.278443 secs
. *. T0 . . . . . 2133785.278444 secs
. . *. . . . . . 2133785.278444 secs
. . . . . . *U0 . 2133785.278445 secs U0 => :13461:13461
. . . . . . *. . 2133785.278460 secs
. *S0 . . . . . . 2133785.278526 secs
. *. . . . . . . 2133785.278533 secs
. . . . *V0 . . . 2133785.279276 secs V0 => :11903:11903
*M0 . . . V0 . . . 2133785.279299 secs
*. . . . V0 . . . 2133785.279308 secs
. . . . *. . . . 2133785.279317 secs
每个CPU一列,星号表示调度事件正在发生的CPU,点号表示CPU正在IDLE
Map 的好处在于提供了一个的总的视图,将成百上千的调度事件进行总结,显示了系统任务在 CPU 之间的分布,假如有不好的调度迁移,比如一个任务没有被及时迁移到 idle 的 CPU 却被迁移到其他忙碌的 CPU,类似这种调度器的问题可以从 map 的报告中一眼看出来
perf sched script
perf sched script
能看到更详细的sched信息,包括sched_wakeup/sched_switch等等。每一列的含义依次是:进程名/pid/CPU ID/时间戳
perf sched record sleep 10
perf sched script
显示调度相关的事件
swapper 0 [006] 2133785.242711: sched:sched_switch: swapper/6:0 [120] R ==> rcu_sched:9 [120]
swapper 0 [006] 2133785.242716: sched:sched_switch: rcu_sched:9 [120] S ==> swapper/6:0 [120]
perf 14836 [001] 2133785.243035: sched:sched_wakeup: node:461 [120] success=1 CPU:002
swapper 0 [002] 2133785.243039: sched:sched_switch: swapper/2:0 [120] R ==> node:461 [120]
perf 14836 [001] 2133785.243044: sched:sched_wakeup: perf:4209 [120] success=1 CPU:003
swapper 0 [003] 2133785.243052: sched:sched_switch: swapper/3:0 [120] R ==> perf:4209 [120]
node 16488 [002] 2133785.243056: sched:sched_wakeup: node:460 [120] success=1 CPU:004
swapper 0 [004] 2133785.243060: sched:sched_switch: swapper/4:0 [120] R ==> node:460 [120]
node 16488 [002] 2133785.243062: sched:sched_stat_runtime: comm=node pid=461 runtime=31449 [ns] vruntime=1973914749726 [ns]
node 16488 [002] 2133785.243064: sched:sched_switch: node:461 [120] S ==> swapper/2:0 [120]
node 16487 [004] 2133785.243064: sched:sched_stat_runtime: comm=node pid=460 runtime=12587 [ns] vruntime=441872725215 [ns]
node 16487 [004] 2133785.243066: sched:sched_switch: node:460 [120] S ==> swapper/4:0 [120]
perf 14836 [001] 2133785.243069: sched:sched_wakeup: node:461 [120] success=1 CPU:002
swapper 0 [002] 2133785.243072: sched:sched_switch: swapper/2:0 [120] R ==> node:461 [120]
node 16488 [002] 2133785.243087: sched:sched_wakeup: node:460 [120] success=1 CPU:004
swapper 0 [004] 2133785.243090: sched:sched_switch: swapper/4:0 [120] R ==> node:460 [120]
node 16487 [004] 2133785.243092: sched:sched_stat_runtime: comm=node pid=460 runtime=8170 [ns] vruntime=441872733385 [ns]
node 16487 [004] 2133785.243093: sched:sched_switch: node:460 [120] S ==> swapper/4:0 [120]
node 16488 [002] 2133785.243098: sched:sched_wakeup: node:460 [120] success=1 CPU:004
可以查看schedule的trace
perf sched timehist
该命令可以获得task的wait time, 特别地, 还能拿到sch delay。timehist统计的sch delay是通过sched_switch和sched_wakeup计算出来的, 而不是stat_wait。
root: /5g_build/5g_Main/$ perf sched record sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.498 MB perf.data (9250 samples) ]
root: /5g_build/5g_Main/$ perf sched timehist
time cpu task name wait time sch delay run time
[tid/pid] (msec) (msec) (msec)
--------------- ------ ------------------------------ --------- --------- ---------
2155009.792729 [0005] <idle> 0.000 0.000 0.000
2155009.792733 [0005] swapper 0.000 0.000 0.003
2155009.793148 [0004] <idle> 0.000 0.000 0.000
2155009.793162 [0005] <idle> 0.003 0.000 0.429
2155009.793169 [0004] node[16488/16476] 0.000 0.000 0.020
2155009.793170 [0006] <idle> 0.000 0.000 0.000
2155009.793184 [0006] node[16487/16476] 0.000 0.000 0.013
2155009.793191 [0004] <idle> 0.020 0.000 0.022
2155009.793207 [0004] node[16488/16476] 0.022 0.000 0.016
2155009.793208 [0006] <idle> 0.013 0.000 0.023
2155009.793215 [0006] node[16487/16476] 0.023 0.000 0.007
2155009.793216 [0004] <idle> 0.016 0.000 0.008
2155009.793232 [0004] node[16488/16476] 0.008 0.000 0.015
2155009.793232 [0006] <idle> 0.007 0.000 0.017
2155009.793239 [0006] node[16487/16476] 0.017 0.000 0.006
2155009.793243 [0004] <idle> 0.015 0.000 0.011
2155009.793258 [0004] node[16488/16476] 0.011 0.000 0.015
2155009.793259 [0006] <idle> 0.006 0.000 0.020
2155009.793265 [0006] node[16487/16476] 0.020 0.000 0.006
2155009.793267 [0004] <idle> 0.015 0.000 0.008
2155009.793283 [0004] node[16488/16476] 0.008 0.000 0.015
2155009.793283 [0006] <idle> 0.006 0.000 0.018
2155009.793290 [0006] node[16487/16476] 0.018 0.000 0.006
2155009.793302 [0004] <idle> 0.015 0.000 0.019
2155009.793315 [0005] perf[32192] 0.000 0.000 0.152
2155009.793318 [0006] <idle> 0.006 0.000 0.028
2155009.793325 [0006] node[16487/16476] 0.028 0.000 0.006
2155009.793327 [0005] swapper 0.582 0.000 0.011
2155009.793334 [0007] <idle> 0.000 0.000 0.000
2155009.793336 [0006] <idle> 0.006 0.000 0.011
2155009.793342 [0006] node[16487/16476] 0.011 0.000 0.006
2155009.793351 [0006] <idle> 0.006 0.000 0.008
2155009.793358 [0006] node[16487/16476] 0.008 0.000 0.006
2155009.793366 [0006] <idle> 0.006 0.000 0.008
2155009.793373 [0006] node[16487/16476] 0.008 0.000 0.006
perf lock·
锁是内核同步的方法,一旦加了锁,其他准备加锁的内核执行路径就必须等待,降低了并行。因此对于锁进行专门分析应该是调优的一项重要工作
- “Name”: 锁的名字,比如 md->map_lock,即定义在 dm.c 结构 mapped_device 中的读写锁
- “acquired”: 该锁被直接获得的次数,即没有其他内核路径拥有该锁的情况下得到该锁的次数
- “contended”冲突的次数,即在准备获得该锁的时候已经被其他人所拥有的情况的出现次数
- “total wait”:为了获得该锁,总共的等待时间
- “max wait”:为了获得该锁,最大的等待时间
- “min wait”:为了获得该锁,最小的等待时间
perf kmem
Perf Kmem 专门收集内核 slab 分配器的相关事件。比如内存分配,释放等。可以用来研究程序在哪里分配了大量内存,或者在什么地方产生碎片之类的和内存管理相关的问题
-
perf kmem record
: 抓取命令的内核slab分配器事件 -
perf kmem stat
: 生成内核slab分配器统计信息
Usage: perf kmem [<options>] {record|stat}
-f, --force don't complain, do it
-i, --input <file> input file name
-l, --line <num> show n lines
-s, --sort <key[,key2...]>
sort by keys: ptr, callsite, bytes, hit, pingpong, frag, page, order, migtype, gfp
-v, --verbose be more verbose (show symbol address, etc)
--alloc show per-allocation statistics
--caller show per-callsite statistics
--live Show live page stat
--page Analyze page allocator
--raw-ip show raw ip instead of symbol
--slab Analyze slab allocator
--time <str> Time span of interest (start,stop)
首先用record
命令记录kmem events
root: /5g_build/5g_Main/$ perf kmem record --force sleep 1
[ perf record: Woken up 0 times to write data ]
Warning:
186 out of order events recorded.
[ perf record: Captured and wrote 144.950 MB perf.data (1574498 samples) ]
再用perf kmem stat --caller
命令查看: (--caller
显示每个调用点统计信息,--alloc
显示每次内存分配事件)
Warning:
186 out of order events recorded.
---------------------------------------------------------------------------------------------------------
Callsite | Total_alloc/Per | Total_req/Per | Hit | Ping-pong | Frag
---------------------------------------------------------------------------------------------------------
cgroup_pidlist_open+8b | 8/8 | 4/4 | 1 | 0 | 50.000%
audit_log_d_path+46 | 16384/8192 | 8214/4107 | 2 | 0 | 49.866%
sk_prot_alloc+c3 | 5310464/2048 | 2758952/1064 | 2593 | 5 | 48.047%
pskb_expand_head+6a | 5230592/1019 | 2948288/574 | 5132 | 15 | 43.634%
show_uevent+7b | 53248/4096 | 30056/2312 | 13 | 0 | 43.555%
alloc_pipe_info+a7 | 475136/1024 | 296960/640 | 464 | 221 | 37.500%
proc_reg_open+32 | 1984/64 | 1240/40 | 31 | 0 | 37.500%
alloc_cpumask_var_node+1f | 4096/1024 | 2560/640 | 4 | 0 | 37.500%
get_mountpoint+2f | 128/64 | 80/40 | 2 | 0 | 37.500%
key_alloc+12d | 8/8 | 5/5 | 1 | 0 | 37.500%
assoc_array_insert+56 | 512/512 | 328/328 | 1 | 0 | 35.938%
cgroup_alloc_name.isra.22+1f | 96/48 | 67/33 | 2 | 0 | 30.208%
seq_open+106 | 194880/192 | 138040/136 | 1015 | 0 | 29.167%
hash_ipport_create+cd | 1344/192 | 952/136 | 7 | 0 | 29.167%
sys_name_to_handle_at+df | 192/192 | 136/136 | 1 | 0 | 29.167%
__alloc_skb+8d | 76957184/2568 | 54961280/1834 | 29962 | 245 | 28.582%
ip_vs_add_dest+232 | 7168/512 | 5264/376 | 14 | 0 | 26.562%
alloc_pipe_info+46 | 89088/192 | 66816/144 | 464 | 221 | 25.000%
alloc_fdtable+57 | 14528/64 | 10896/48 | 227 | 181 | 25.000%
allocate_cg_links+55 | 5824/64 | 4368/48 | 91 | 0 | 25.000%
proc_self_follow_link+66 | 384/16 | 288/12 | 24 | 0 | 25.000%
kernfs_fop_open+1c8 | 1152/64 | 864/48 | 18 | 0 | 25.000%
bitmap_port_create+a9 | 768/192 | 576/144 | 4 | 0 | 25.000%
kstrdup_const+23 | 32/8 | 24/6 | 4 | 0 | 25.000%
ext4_readdir+716 | 192/64 | 144/48 | 3 | 0 | 25.000%
___sys_sendmsg+21e | 384/192 | 288/144 | 2 | 0 | 25.000%
hash_ipportip_create+cd | 384/192 | 288/144 | 2 | 0 | 25.000%
scm_fp_dup+28 | 64/32 | 48/24 | 2 | 2 | 25.000%
strndup_user+4b | 16/8 | 12/6 | 2 | 0 | 25.000%
_autofs_dev_ioctl+c0 | 96/48 | 73/36 | 2 | 0 | 23.958%
load_elf_interp.constprop.10+84 | 120320/512 | 92120/392 | 235 | 0 | 23.438%
sget_userns+c2 | 2048/2048 | 1600/1600 | 1 | 0 | 21.875%
hash_ipportnet_create+d2 | 512/512 | 400/400 | 1 | 0 | 21.875%
groups_alloc+34 | 192/192 | 152/152