注:在java中当我们讨论NIO/IO的时候,通常指的是网络I/O,不过本文我们主要讨论文件I/O,他们本质上没有太大差别,对于Java NIO的讨论,可以参见java nio深入理解之MMAP与ByteBuffer、DirectBuffer。
在实际工作中,绝大多数业务开发是不需要太多关心I/O机制的,但是如果涉及到大量的文件处理、I/O相关的,那么深入了解I/O机制就会体现其优势所在。但是反过来,没有足够的量和场景,又很难支撑你真正理解设计思路背后的意义。所以,我们重点从实践出发。
下图很清晰的解释了应用发起一个I/O操作的控制流和数据流:
直接I/O介绍
默认情况下,linux会将进程未使用的内存用于页面缓存,因为相比直接读写磁盘而言,内存的性能高几个数量级,即使是SSD亦如此。如下:
06:50:01 kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit 07:00:01 1002144 44274036 97.79 486416 17717204 28436468 51.22 07:10:01 1008924 44267256 97.77 486420 17717864 28436816 51.22 07:20:01 1007176 44269004 97.78 486420 17718524 28437600 51.22 07:30:01 1000732 44275448 97.79 486424 17719140 28438832 51.23 Average: 1022556 44253624 97.74 486373 17696996 28433150 51.22 [root@hs-test-10-20-30-16 ~]# sar -r 3 10000 Linux 2.6.32-431.el6.x86_64 (hs-test-10-20-30-16) 01/12/20 _x86_64_ (16 CPU) 07:35:16 kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit 07:35:19 19432976 25843204 57.08 7968 215188 28435792 51.22 07:35:22 19432744 25843436 57.08 7984 215380 28436616 51.22 07:35:25 19431932 25844248 57.08 8024 215348 28435768 51.22
虽然页面缓存能够极大的提升性能,但是其不是没有缺点的,尤其是哪些只会读写一次或应用自身对数据进行了缓存的应用如oracle/mysql,如果全部os缓存,很可能会导致应用内存不足(特别是在应用不支持启动时lock memory或多应用环境下无法这么做时)。
先来看下直接I/O相比缓存I/O的性能差异,使用dd测试。
# 初始页面缓存占用 [root@hs-test-10-20-30-16 ~]# sar -r 3 10000 Linux 2.6.32-431.el6.x86_64 (hs-test-10-20-30-16) 01/12/20 _x86_64_ (16 CPU) 07:35:16 kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit 07:35:19 19432976 25843204 57.08 7968 215188 28435792 51.22 07:35:22 19432744 25843436 57.08 7984 215380 28436616 51.22
缓存模式,写测试。
[root@hs-test-10-20-30-16 ~]# dd if=/dev/zero of=./a.dat bs=4k count=1M 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 4.45434 s, 964 MB/s #######写完成后的占用,大约4.2G,和dd结果基本是一致的,说明几乎整个文件被os缓存了 07:36:07 19427920 25848260 57.09 9332 216000 28435792 51.22 07:36:10 19249908 26026272 57.48 9612 392076 28436136 51.22 07:36:13 16245884 29030296 64.12 9776 3312448 28436136 51.22 07:36:16 15115508 30160672 66.61 9948 4410488 28436828 51.22 07:36:19 15115916 30160264 66.61 10000 4410560 28435804 51.22
缓存模式,读测试。
直接I/O模式写测试。
[root@hs-test-10-20-30-16 ~]# echo 1 > /proc/sys/vm/drop_caches You have new mail in /var/spool/mail/root [root@hs-test-10-20-30-16 ~]# rm -rf a.dat [root@hs-test-10-20-30-16 ~]# dd if=/dev/zero of=./a.dat bs=4k count=1M oflag=direct # 非SSD盘,实在太慢了。。。 ^C33848+0 records in 33848+0 records out 138641408 bytes (139 MB) copied, 141.044 s, 983 kB/s #########页面缓存倒是几乎无增长,这太少,下文换SSD验证。。。 [root@hs-test-10-20-30-16 ~]# sar -r 3 10000 Linux 2.6.32-431.el6.x86_64 (hs-test-10-20-30-16) 01/12/20 _x86_64_ (16 CPU) 07:37:59 kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit 07:38:02 19445272 25830908 57.05 3496 209396 28437716 51.22 07:38:05 19445692 25830488 57.05 3528 209400 28437716 51.22 07:38:08 19446224 25829956 57.05 3588 209496 28435980 51.22 07:38:11 19445984 25830196 57.05 3692 209436 28436016 51.22 07:38:14 19445280 25830900 57.05 3764 209624 28437556 51.22 07:38:17 19445384 25830796 57.05 3964 209648 28436008 51.22 07:38:20 19445856 25830324 57.05 3988 209680 28436008 51.22 07:38:23 19446144 25830036 57.05 3996 209852 28436008 51.22 07:38:26 19420724 25855456 57.11 4052 211500 28567488 51.46 07:38:29 19442600 25833580 57.06 4172 213200 28435996 51.22 07:38:32 19442276 25833904 57.06 4332 213292 28436632 51.22 07:38:35 19442784 25833396 57.06 4356 213304 28436632 51.22 07:38:38 19442564 25833616 57.06 4420 213336 28436020 51.22 07:38:41 19439588 25836592 57.06 5268 213620 28436576 51.22 07:38:44 19439272 25836908 57.07 5428 213720 28439096 51.23 07:38:47 19439028 25837152 57.07 5644 213792 28438580 51.23 07:38:50 19439276 25836904 57.07 5700 213820 28437556 51.22 07:38:53 19439324 25836856 57.07 5708 214004 28437560 51.22 07:38:56 19439328 25836852 57.06 5748 214036 28437552 51.22 07:38:59 19439336 25836844 57.06 5788 214036 28437552 51.22 07:39:02 19436836 25839344 57.07 5968 214064 28439620 51.23 07:39:05 19437088 25839092 57.07 5992 214072 28439620 51.23 07:39:08 19438556 25837624 57.07 6060 214104 28437588 51.22 07:39:11 19439016 25837164 57.07 6164 214044 28437588 51.22 07:39:14 19439084 25837096 57.07 6228 214200 28439108 51.23 07:39:17 19439704 25836476 57.06 6420 214220 28437584 51.22 07:39:20 19439792 25836388 57.06 6428 214256 28437584 51.22 07:39:23 19439912 25836268 57.06 6436 214424 28437564 51.22 07:39:26 19440320 25835860 57.06 6492 214440 28437556 51.22 07:39:29 19439436 25836744 57.06 6724 214408 28437536 51.22 07:39:32 19439136 25837044 57.07 6868 214484 28438172 51.23
[root@hs-test-10-20-30-16 ~]# dd if=/dev/zero of=/storage/backupfile/a.dat bs=4k count=1M oflag=direct # ssd下的测试 ^C58586+0 records in 58586+0 records out 239968256 bytes (240 MB) copied, 41.796 s, 5.7 MB/s [root@hs-test-10-20-30-16 ~]# sar -r 3 10000 Linux 2.6.32-431.el6.x86_64 (hs-test-10-20-30-16) 01/12/20 _x86_64_ (16 CPU) 07:40:49 kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit 07:40:52 19434196 25841984 57.08 9524 216276 28438172 51.23 07:40:55 19435088 25841092 57.07 9548 216260 28437328 51.22 07:40:58 19434588 25841592 57.08 9644 216340 28437668 51.22 07:41:01 19433780 25842400 57.08 9784 216356 28438304 51.23 07:41:04 19430940 25845240 57.08 9840 216392 28439812 51.23 07:41:07 19431788 25844392 57.08 9904 216416 28439192 51.23 07:41:10 19432324 25843856 57.08 9932 216452 28437684 51.22 07:41:13 19432448 25843732 57.08 10060 216440 28437672 51.22 07:41:16 19432292 25843888 57.08 10232 216504 28438692 51.23 07:41:19 19431964 25844216 57.08 10288 216524 28437668 51.22 07:41:22 19431932 25844248 57.08 10316 216696 28438504 51.23 07:41:25 19432580 25843600 57.08 10324 216708 28437668 51.22 07:41:28 19430640 25845540 57.08 10396 216740 28437636 51.22 07:41:31 19430316 25845864 57.08 10572 216752 28438272 51.23 07:41:34 19430592 25845588 57.08 10604 216772 28438272 51.23 07:41:37 19430408 25845772 57.08 10668 216796 28437628 51.22 07:41:40 19430940 25845240 57.08 10804 216816 28437320 51.22 07:41:43 19431108 25845072 57.08 10972 216824 28437336 51.22 07:41:46 19430932 25845248 57.08 11148 216844 28438348 51.23 07:41:49 19430864 25845316 57.08 11316 216908 28437328 51.22
从上可知,如果没有缓存,随机直接(指的是4K/8K块)I/O的性能是非常低下的。大块(顺序I/O)则快得多。如下:
[root@hs-test-10-20-30-16 ~]# echo 1 > /proc/sys/vm/drop_caches You have new mail in /var/spool/mail/root [root@hs-test-10-20-30-16 ~]# dd if=/dev/zero of=./a.dat bs=1M count=4K oflag=direct 4096+0 records in 4096+0 records out 4294967296 bytes (4.3 GB) copied, 39.981 s, 107 MB/s [root@hs-test-10-20-30-16 ~]# dd if=/dev/zero of=/storage/backupfile/a.dat bs=1M count=4K oflag=direct 4096+0 records in 4096+0 records out 4294967296 bytes (4.3 GB) copied, 20.3327 s, 211 MB/s [root@hs-test-10-20-30-16 ~]# sar -r 3 10000 Linux 2.6.32-431.el6.x86_64 (hs-test-10-20-30-16) 01/12/20 _x86_64_ (16 CPU) 08:22:45 kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit 08:22:48 19435768 25840412 57.07 4732 213608 28438956 51.23 08:22:51 19435788 25840392 57.07 4772 213816 28439804 51.23 08:22:54 19435844 25840336 57.07 4788 213840 28438968 51.23 08:22:57 19437232 25838948 57.07 4864 213860 28437596 51.22
kafka之所以宣传磁盘比内存块,也是谙熟此原理。
对于Linux下的进程如cp、scp、dd,大多数情况下在I/O时也会依赖操作系统的缓存特性,即预读模式(每次预读块数由参数)。如下:
由此可见,内存浪费还是很严重的,此时一种做法就是执行命令echo 1 > /proc/sys/vm/drop_caches清空缓存,但是这容易误伤那些其实真的需要缓存的内容。
很多应用则提供了选项供用户决定。下面列出了一些主要应用和命令对文件I/O的管理方式。
应用 | 默认模式(依赖OS或自身管理) | 相关控制参数 |
oracle | sga控制 | filesystemio_options |
mysql | buffer_pool控制 | innodb_flush_log_at_trx_commit |
mongodb | 依赖于操作系统 | |
c | 通过open文件带上O_DIRECT参数,直接写到设备上。 | |
java | 自己控制(文件读写除外),没有直接的参数控制 | |
kafka | scala写的,依赖于JVM,所以本质上也是自己控制,文件则依赖于OS | |
redis | 自己控制 | maxmemory,提供了超过后的策略 |
zookeeper | 跟kafka类似 |
那cp、scp、tar这些命令操作的文件通常都比较大,且一般也不会被访问,如何实现直接I/O呢?
那试试用dd进行复制,如下:
[root@hs-test-10-20-30-16 ~]# dd if=glibc-2.14.tar.gz of=glibc-2.14.tar.gz.new iflag=direct oflag=direct ^C9270+0 records in 9270+0 records out 4746240 bytes (4.7 MB) copied, 38.6592 s, 123 kB/s
其速度是在太慢了(受制于需要先定位,默认每次只读512字节)。改成每次4M。如下:
[root@hs-test-10-20-30-16 ~]# rm -rf glibc-2.14 [root@hs-test-10-20-30-16 ~]# echo 1 > /proc/sys/vm/drop_caches [root@hs-test-10-20-30-16 ~]# dd if=glibc-2.14.tar.gz of=glibc-2.14.tar.gz.new ibs=4M obs=4M iflag=direct oflag=direct 4+1 records in 4+1 records out 20897040 bytes (21 MB) copied, 0.366115 s, 57.1 MB/s You have new mail in /var/spool/mail/root [root@hs-test-10-20-30-16 ~]# sar -r 3 10000 Linux 2.6.32-431.el6.x86_64 (hs-test-10-20-30-16) 01/12/20 _x86_64_ (16 CPU) 09:14:12 kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit 09:14:15 19435044 25841136 57.07 1764 204420 28444528 51.24 09:14:18 19435516 25840664 57.07 2052 204376 28444144 51.24 09:14:21 19434392 25841788 57.08 2216 204640 28444184 51.24 09:14:24 19434300 25841880 57.08 2268 204768 28443632 51.23 09:14:27 19435184 25840996 57.07 2520 204892 28443596 51.23 09:14:30 19429620 25846560 57.09 2944 209396 28443532 51.23 09:14:33 19429732 25846448 57.09 3476 209476 28443532 51.23 09:14:36 19430160 25846020 57.09 3672 209724 28443532 51.23 [root@hs-test-10-20-30-16 ~]# tar xzvf glibc-2.14.tar.gz.new | more glibc-2.14/ glibc-2.14/.gitattributes glibc-2.14/.gitignore glibc-2.14/BUGS glibc-2.14/CANCEL-FCT-WAIVE glibc-2.14/CANCEL-FILE-WAIVE glibc-2.14/CONFORMANCE glibc-2.14/COPYING
这样通过dd就可以代替cp了,dd唯一遗憾的就是不支持批量文件复制。
除了使用dd外,有个老外写了个nocache命令,能够使得cp/scp等命令在执行后立刻清理页面缓存。如下:
[root@hs-test-10-20-30-16 ~]# cd nocache-1.1 [root@hs-test-10-20-30-16 nocache-1.1]# ll total 68 -rw-r--r-- 1 root root 1328 Dec 22 2018 COPYING -rw-r--r-- 1 root root 1453 Dec 22 2018 Makefile -rw-r--r-- 1 root root 6220 Dec 22 2018 README lrwxrwxrwx 1 root root 6 Jan 12 11:42 README.md -> README -rw-r--r-- 1 root root 1146 Dec 22 2018 cachedel.c -rw-r--r-- 1 root root 2665 Dec 22 2018 cachestats.c -rw-r--r-- 1 root root 889 Dec 22 2018 fcntl_helpers.c -rw-r--r-- 1 root root 289 Dec 22 2018 fcntl_helpers.h drwxr-xr-x 2 root root 4096 Dec 22 2018 man -rw-r--r-- 1 root root 13962 Dec 22 2018 nocache.c -rw-r--r-- 1 root root 391 Dec 22 2018 nocache.in -rw-r--r-- 1 root root 3663 Dec 22 2018 pageinfo.c -rw-r--r-- 1 root root 323 Dec 22 2018 pageinfo.h drwxr-xr-x 2 root root 4096 Dec 22 2018 t [root@hs-test-10-20-30-16 nocache-1.1]# make cc -Wall -o cachedel cachedel.c cc -Wall -o cachestats cachestats.c cc -Wall -fPIC -c -o nocache.o nocache.c cc -Wall -fPIC -c -o fcntl_helpers.o fcntl_helpers.c cc -Wall -fPIC -c -o pageinfo.o pageinfo.c cc -Wall -pthread -shared -Wl,-soname,nocache.so -o nocache.so nocache.o fcntl_helpers.o pageinfo.o -ldl sed 's!##libdir##!$(dirname "$0")!' <nocache.in >nocache chmod a+x nocache You have new mail in /var/spool/mail/root [root@hs-test-10-20-30-16 nocache-1.1]# ll total 156 -rw-r--r-- 1 root root 1328 Dec 22 2018 COPYING -rw-r--r-- 1 root root 1453 Dec 22 2018 Makefile -rw-r--r-- 1 root root 6220 Dec 22 2018 README lrwxrwxrwx 1 root root 6 Jan 12 11:42 README.md -> README -rwxr-xr-x 1 root root 8192 Jan 12 11:43 cachedel -rw-r--r-- 1 root root 1146 Dec 22 2018 cachedel.c -rwxr-xr-x 1 root root 9948 Jan 12 11:43 cachestats -rw-r--r-- 1 root root 2665 Dec 22 2018 cachestats.c -rw-r--r-- 1 root root 889 Dec 22 2018 fcntl_helpers.c -rw-r--r-- 1 root root 289 Dec 22 2018 fcntl_helpers.h -rw-r--r-- 1 root root 2352 Jan 12 11:43 fcntl_helpers.o drwxr-xr-x 2 root root 4096 Dec 22 2018 man -rwxr-xr-x 1 root root 396 Jan 12 11:43 nocache -rw-r--r-- 1 root root 13962 Dec 22 2018 nocache.c -rw-r--r-- 1 root root 391 Dec 22 2018 nocache.in -rw-r--r-- 1 root root 23064 Jan 12 11:43 nocache.o -rwxr-xr-x 1 root root 26208 Jan 12 11:43 nocache.so -rw-r--r-- 1 root root 3663 Dec 22 2018 pageinfo.c -rw-r--r-- 1 root root 323 Dec 22 2018 pageinfo.h -rw-r--r-- 1 root root 4472 Jan 12 11:43 pageinfo.o drwxr-xr-x 2 root root 4096 Dec 22 2018 t
[root@hs-test-10-20-30-16 nocache-1.1]# ./nocache scp ~/a.dat 10.20.30.17:/tmp/ The authenticity of host '10.20.30.17 (10.20.30.17)' can't be established. RSA key fingerprint is 7a:dc:af:bf:4d:e1:62:cd:b4:53:df:0f:6c:a0:55:26. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '10.20.30.17' (RSA) to the list of known hosts. root@10.20.30.17's password: a.dat 100% 4096MB 110.7MB/s 00:37 11:48:04 kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit 11:48:07 19219744 26056436 57.55 10876 358108 28445004 51.24 11:48:10 19207224 26068956 57.58 10896 359204 28455348 51.26 11:48:13 19207668 26068512 57.58 11080 359236 28455352 51.26 11:48:16 18989400 26286780 58.06 11204 573024 28459352 51.26 11:48:19 18658576 26617604 58.79 11228 903856 28458964 51.26 11:48:22 18329272 26946908 59.52 11288 1231132 28459276 51.26 11:48:25 17987808 27288372 60.27 11312 1573436 28458672 51.26 11:48:28 17643640 27632540 61.03 11328 1917184 28458672 51.26 11:48:31 17304284 27971896 61.78 11464 2255576 28458584 51.26 11:48:34 16966124 28310056 62.53 11544 2593788 28458584 51.26 11:48:37 16623688 28652492 63.28 11552 2935548 28458584 51.26 11:48:40 16287292 28988888 64.03 11568 3273164 28458576 51.26 11:48:43 15952856 29323324 64.77 11840 3606952 28458576 51.26 11:48:46 15621704 29654476 65.50 11960 3936816 28460108 51.26 11:48:49 15309600 29966580 66.19 12016 4247848 28459196 51.26 11:48:52 16223364 29052816 64.17 12064 3333380 28458560 51.26 11:48:55 19219104 26057076 57.55 12092 359880 28442844 51.23 11:48:58 19217836 26058344 57.55 12220 359816 28442864 51.23
可见,nocache并非绕过页面缓存,而是执行完成后把本命令导致的缓存释放,所以波动会比较大。
除了使用命令行外,还可以通过cgroup memcg限制用户的使用量,但是因为pagecache是全局共享的(有可能是别的group申请的),所以memcg只能保证rss+pagecache不超过内存配额,不能单独限制每个group的pagecache用量,只是在内存不足时pagecache会被直接丢掉,虽然提供了group级别的统计功能https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt,这样要精确隔离的话,就得为这些命令建立单独一个cgroup。
[root@hs-test-10-20-30-16 ~]# yum install libcgroup [root@hs-test-10-20-30-16 ~]# service cgconfig start Starting cgconfig service: [ OK ] [root@hs-test-10-20-30-16 ~]# cgcreate -a uft_trade_mysql -gmemory:memcg # as published by the Free Software Foundation. # # This program is distributed in the hope that it would be useful, but # WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. # # See man cgconfig.conf for further details. # # By default, mount all controllers to /cgroup/<controller> mount { cpuset = /cgroup/cpuset; cpu = /cgroup/cpu; cpuacct = /cgroup/cpuacct; memory = /cgroup/memory; devices = /cgroup/devices; freezer = /cgroup/freezer; net_cls = /cgroup/net_cls; blkio = /cgroup/blkio; } [root@hs-test-10-20-30-16 fs]# cat /cgroup/memory/ cgroup.event_control memory.limit_in_bytes memory.memsw.usage_in_bytes memory.swappiness tasks cgroup.procs memory.max_usage_in_bytes memory.move_charge_at_immigrate memory.usage_in_bytes memcg/ memory.memsw.failcnt memory.oom_control memory.use_hierarchy memory.failcnt memory.memsw.limit_in_bytes memory.soft_limit_in_bytes notify_on_release memory.force_empty memory.memsw.max_usage_in_bytes memory.stat release_agent [root@hs-test-10-20-30-16 fs]# cd /cgroup/memory/memcg/ [root@hs-test-10-20-30-16 memcg]# cat memory.limit_in_bytes 9223372036854775807 [root@hs-test-10-20-30-16 memcg]# echo 1048576 > memory.limit_in_bytes [root@hs-test-10-20-30-16 memcg]# echo $$ > tasks #将当前终端加入限制组 [root@hs-test-10-20-30-16 tmp]# cp a.dat.new ~/a.dat # 限额1M时内存不足被OOM kill Killed Jan 12 20:26:23 hs-test-10-20-30-16 kernel: 11534334 pages RAM Jan 12 20:26:23 hs-test-10-20-30-16 kernel: 215289 pages reserved Jan 12 20:26:23 hs-test-10-20-30-16 kernel: 960676 pages shared Jan 12 20:26:23 hs-test-10-20-30-16 kernel: 2365211 pages non-shared Jan 12 20:26:23 hs-test-10-20-30-16 kernel: [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name Jan 12 20:26:23 hs-test-10-20-30-16 kernel: [15199] 0 15199 2894 459 2 0 0 bash Jan 12 20:26:23 hs-test-10-20-30-16 kernel: [22192] 0 22192 28414 175 6 0 0 cp Jan 12 20:26:23 hs-test-10-20-30-16 kernel: Memory cgroup out of memory: Kill process 15199 (bash) score 1 or sacrifice child Jan 12 20:26:23 hs-test-10-20-30-16 kernel: Killed process 22192, UID 0, (cp) total-vm:113656kB, anon-rss:44kB, file-rss:656kB [root@hs-test-10-20-30-16 ~]# cd /cgroup/memory/memcg/ [root@hs-test-10-20-30-16 memcg]# echo 104857600 > memory.limit_in_bytes [root@hs-test-10-20-30-16 memcg]# cat tasks 15199 22651 [root@hs-test-10-20-30-16 memcg]# echo $$ > tasks [root@hs-test-10-20-30-16 memcg]# cp /tmp/a.dat.new ~/a.dat # 大约66秒完成,不限额时大约56秒完成 cp: overwrite `/root/a.dat'? y You have new mail in /var/spool/mail/root [root@hs-test-10-20-30-16 memcg]# echo 10485760 > memory.limit_in_bytes [root@hs-test-10-20-30-16 memcg]# cp /tmp/a.dat.new ~/a.dat # 大约84秒完成 cp: overwrite `/root/a.dat'? y You have new mail in /var/spool/mail/root [root@hs-test-10-20-30-16 memcg]# cd ../
08:27:50 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit 08:29:14 PM 18944964 26331216 58.16 86284 593212 28445048 51.24 08:29:17 PM 18945072 26331108 58.16 86412 593208 28443652 51.23 08:29:20 PM 18944588 26331592 58.16 86444 593284 28444056 51.24 08:29:23 PM 18866972 26409208 58.33 86484 670180 28443420 51.23 08:29:26 PM 18840408 26435772 58.39 86484 694804 28443504 51.23 08:29:29 PM 18837164 26439016 58.39 86496 694528 28443488 51.23 08:29:32 PM 18837288 26438892 58.39 86792 694452 28444000 51.24 08:29:35 PM 18837964 26438216 58.39 86804 694560 28444000 51.24 08:29:38 PM 18836368 26439812 58.40 86820 694600 28444804 51.24 08:29:41 PM 18836168 26440012 58.40 86864 694736 28443968 51.24 08:29:44 PM 18834656 26441524 58.40 87024 694700 28455748 51.26 08:29:47 PM 18834260 26441920 58.40 87176 695292 28454864 51.26 08:29:50 PM 18833284 26442896 58.40 87196 695076 28454864 51.26 [root@hs-test-10-20-30-16 ~]# sar -r 3 10000 Linux 2.6.32-431.el6.x86_64 (hs-test-10-20-30-16) 01/12/2020 _x86_64_ (16 CPU) 08:31:15 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit 08:31:18 PM 18920476 26355704 58.21 89628 604828 28445160 51.24 08:31:21 PM 18920328 26355852 58.21 89656 604948 28445160 51.24 08:31:24 PM 18918732 26357448 58.21 89688 605080 28444520 51.24
从上可知cgroup也可以和dd一样,可以保证不额外占用pagecache,即使是用来过渡,但cgroup一定要保证有rss和anon外有一定的最小额外内存,否则容易发生OOM。
nginx中的直接I/O
增加"directio 512k"指令后,nginx对于静态资源就会走直接I/O模式了,也只有增加了该指令才是真正的直接I/O。sendfile(它调用的是系统调用sendfile,但是sendfile默认情况下并不走NOCACHE_IO模式,所以仍然会走页面缓存,参见https://man.openbsd.org/FreeBSD-11.1/sendfile.2#SF_NOCACHE) on。所有要开启直接I/O,需要增加下列指令:
aio threads; directio 512k; output_buffers 1 8m; # 其对性能会有高达10%的影响,用于设置从磁盘读取缓冲区响应的数量和大小。
其性能差异如下:
output_buffers为256k时,如下:
aio threads指令开启与否对性能影响不是很大。
当directio指令关闭时,如下:
虽然sendfile on;开启,但是pagecache仍然经过了。
对于文件读写,curl的--no-buffer选项好像并没有生效,这样要想实现低内存占用文件,就得使用java nio或c direct io自行实现。
java实现直接I/O
java虽然提供了大量的I/O接口,但是大部分上层接口都是基于文件系统缓存的。要操作底层就需要对NIO比较熟悉了,核心又是DirectBuffer和ByteBuffer,参见java nio深入理解之MMAP与ByteBuffer、DirectBuffer。github上封装了一个库https://github.com/lexburner/kdio/blob/master/README.md(主:其基于log4j实现,需要自行下载源码修改为slf4j)也是基于此+JNAStream(因为java原生并不支持O_DIRECT选项,这样有个缺点就是写大文件会占用pagecache、亦或性能受到影响,因为要每次映射一部分,参见java nio深入理解之通道Channel,O_SYNC是支持人工调用的)。关于文件I/O的一些优化实践可以参考:https://www.tuicool.com/articles/eEnIny6。
https://www.ibm.com/developerworks/cn/linux/l-cn-directio/(Linux下直接I/O的设计与实现)