注:在java中当我们讨论NIO/IO的时候,通常指的是网络I/O,不过本文我们主要讨论文件I/O,他们本质上没有太大差别,对于Java NIO的讨论,可以参见java nio深入理解之MMAP与ByteBuffer、DirectBuffer

  在实际工作中,绝大多数业务开发是不需要太多关心I/O机制的,但是如果涉及到大量的文件处理、I/O相关的,那么深入了解I/O机制就会体现其优势所在。但是反过来,没有足够的量和场景,又很难支撑你真正理解设计思路背后的意义。所以,我们重点从实践出发。

  下图很清晰的解释了应用发起一个I/O操作的控制流和数据流:

linux下直接I/O(direct io)深入解析与实践_Linux

 

 

直接I/O介绍

  默认情况下,linux会将进程未使用的内存用于页面缓存,因为相比直接读写磁盘而言,内存的性能高几个数量级,即使是SSD亦如此。如下:

06:50:01    kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit
07:00:01      1002144  44274036     97.79    486416  17717204  28436468     51.22
07:10:01      1008924  44267256     97.77    486420  17717864  28436816     51.22
07:20:01      1007176  44269004     97.78    486420  17718524  28437600     51.22
07:30:01      1000732  44275448     97.79    486424  17719140  28438832     51.23
Average:      1022556  44253624     97.74    486373  17696996  28433150     51.22
[root@hs-test-10-20-30-16 ~]# sar -r 3 10000
Linux 2.6.32-431.el6.x86_64 (hs-test-10-20-30-16) 	01/12/20 	_x86_64_	(16 CPU)

07:35:16    kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit
07:35:19     19432976  25843204     57.08      7968    215188  28435792     51.22
07:35:22     19432744  25843436     57.08      7984    215380  28436616     51.22
07:35:25     19431932  25844248     57.08      8024    215348  28435768     51.22

  虽然页面缓存能够极大的提升性能,但是其不是没有缺点的,尤其是哪些只会读写一次或应用自身对数据进行了缓存的应用如oracle/mysql,如果全部os缓存,很可能会导致应用内存不足(特别是在应用不支持启动时lock memory或多应用环境下无法这么做时)。

  先来看下直接I/O相比缓存I/O的性能差异,使用dd测试。

# 初始页面缓存占用
[root@hs-test-10-20-30-16 ~]# sar -r 3 10000
Linux 2.6.32-431.el6.x86_64 (hs-test-10-20-30-16) 	01/12/20 	_x86_64_	(16 CPU)

07:35:16    kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit
07:35:19     19432976  25843204     57.08      7968    215188  28435792     51.22
07:35:22     19432744  25843436     57.08      7984    215380  28436616     51.22

  缓存模式,写测试。

[root@hs-test-10-20-30-16 ~]# dd if=/dev/zero of=./a.dat bs=4k count=1M
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 4.45434 s, 964 MB/s
#######写完成后的占用,大约4.2G,和dd结果基本是一致的,说明几乎整个文件被os缓存了
07:36:07     19427920  25848260     57.09      9332    216000  28435792     51.22
07:36:10     19249908  26026272     57.48      9612    392076  28436136     51.22
07:36:13     16245884  29030296     64.12      9776   3312448  28436136     51.22
07:36:16     15115508  30160672     66.61      9948   4410488  28436828     51.22
07:36:19     15115916  30160264     66.61     10000   4410560  28435804     51.22

  缓存模式,读测试。

  直接I/O模式写测试。

[root@hs-test-10-20-30-16 ~]#  echo 1 > /proc/sys/vm/drop_caches
You have new mail in /var/spool/mail/root
[root@hs-test-10-20-30-16 ~]# rm -rf a.dat 
[root@hs-test-10-20-30-16 ~]# dd if=/dev/zero of=./a.dat bs=4k count=1M oflag=direct    # 非SSD盘,实在太慢了。。。
^C33848+0 records in
33848+0 records out
138641408 bytes (139 MB) copied, 141.044 s, 983 kB/s
#########页面缓存倒是几乎无增长,这太少,下文换SSD验证。。。
[root@hs-test-10-20-30-16 ~]# sar -r 3 10000
Linux 2.6.32-431.el6.x86_64 (hs-test-10-20-30-16) 	01/12/20 	_x86_64_	(16 CPU)

07:37:59    kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit
07:38:02     19445272  25830908     57.05      3496    209396  28437716     51.22
07:38:05     19445692  25830488     57.05      3528    209400  28437716     51.22
07:38:08     19446224  25829956     57.05      3588    209496  28435980     51.22
07:38:11     19445984  25830196     57.05      3692    209436  28436016     51.22
07:38:14     19445280  25830900     57.05      3764    209624  28437556     51.22
07:38:17     19445384  25830796     57.05      3964    209648  28436008     51.22
07:38:20     19445856  25830324     57.05      3988    209680  28436008     51.22
07:38:23     19446144  25830036     57.05      3996    209852  28436008     51.22
07:38:26     19420724  25855456     57.11      4052    211500  28567488     51.46
07:38:29     19442600  25833580     57.06      4172    213200  28435996     51.22
07:38:32     19442276  25833904     57.06      4332    213292  28436632     51.22
07:38:35     19442784  25833396     57.06      4356    213304  28436632     51.22
07:38:38     19442564  25833616     57.06      4420    213336  28436020     51.22
07:38:41     19439588  25836592     57.06      5268    213620  28436576     51.22
07:38:44     19439272  25836908     57.07      5428    213720  28439096     51.23
07:38:47     19439028  25837152     57.07      5644    213792  28438580     51.23
07:38:50     19439276  25836904     57.07      5700    213820  28437556     51.22
07:38:53     19439324  25836856     57.07      5708    214004  28437560     51.22
07:38:56     19439328  25836852     57.06      5748    214036  28437552     51.22
07:38:59     19439336  25836844     57.06      5788    214036  28437552     51.22
07:39:02     19436836  25839344     57.07      5968    214064  28439620     51.23
07:39:05     19437088  25839092     57.07      5992    214072  28439620     51.23
07:39:08     19438556  25837624     57.07      6060    214104  28437588     51.22
07:39:11     19439016  25837164     57.07      6164    214044  28437588     51.22
07:39:14     19439084  25837096     57.07      6228    214200  28439108     51.23
07:39:17     19439704  25836476     57.06      6420    214220  28437584     51.22
07:39:20     19439792  25836388     57.06      6428    214256  28437584     51.22
07:39:23     19439912  25836268     57.06      6436    214424  28437564     51.22
07:39:26     19440320  25835860     57.06      6492    214440  28437556     51.22
07:39:29     19439436  25836744     57.06      6724    214408  28437536     51.22
07:39:32     19439136  25837044     57.07      6868    214484  28438172     51.23
[root@hs-test-10-20-30-16 ~]# dd if=/dev/zero of=/storage/backupfile/a.dat bs=4k count=1M oflag=direct    # ssd下的测试
^C58586+0 records in
58586+0 records out
239968256 bytes (240 MB) copied, 41.796 s, 5.7 MB/s

[root@hs-test-10-20-30-16 ~]# sar -r 3 10000
Linux 2.6.32-431.el6.x86_64 (hs-test-10-20-30-16) 	01/12/20 	_x86_64_	(16 CPU)

07:40:49    kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit
07:40:52     19434196  25841984     57.08      9524    216276  28438172     51.23
07:40:55     19435088  25841092     57.07      9548    216260  28437328     51.22
07:40:58     19434588  25841592     57.08      9644    216340  28437668     51.22
07:41:01     19433780  25842400     57.08      9784    216356  28438304     51.23
07:41:04     19430940  25845240     57.08      9840    216392  28439812     51.23
07:41:07     19431788  25844392     57.08      9904    216416  28439192     51.23
07:41:10     19432324  25843856     57.08      9932    216452  28437684     51.22
07:41:13     19432448  25843732     57.08     10060    216440  28437672     51.22
07:41:16     19432292  25843888     57.08     10232    216504  28438692     51.23
07:41:19     19431964  25844216     57.08     10288    216524  28437668     51.22
07:41:22     19431932  25844248     57.08     10316    216696  28438504     51.23
07:41:25     19432580  25843600     57.08     10324    216708  28437668     51.22
07:41:28     19430640  25845540     57.08     10396    216740  28437636     51.22
07:41:31     19430316  25845864     57.08     10572    216752  28438272     51.23
07:41:34     19430592  25845588     57.08     10604    216772  28438272     51.23
07:41:37     19430408  25845772     57.08     10668    216796  28437628     51.22
07:41:40     19430940  25845240     57.08     10804    216816  28437320     51.22
07:41:43     19431108  25845072     57.08     10972    216824  28437336     51.22
07:41:46     19430932  25845248     57.08     11148    216844  28438348     51.23
07:41:49     19430864  25845316     57.08     11316    216908  28437328     51.22

  从上可知,如果没有缓存,随机直接(指的是4K/8K块)I/O的性能是非常低下的。大块(顺序I/O)则快得多。如下:

[root@hs-test-10-20-30-16 ~]#  echo 1 > /proc/sys/vm/drop_caches
You have new mail in /var/spool/mail/root
[root@hs-test-10-20-30-16 ~]# dd if=/dev/zero of=./a.dat bs=1M count=4K oflag=direct
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 39.981 s, 107 MB/s
[root@hs-test-10-20-30-16 ~]# dd if=/dev/zero of=/storage/backupfile/a.dat bs=1M count=4K oflag=direct
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 20.3327 s, 211 MB/s

[root@hs-test-10-20-30-16 ~]# sar -r 3 10000
Linux 2.6.32-431.el6.x86_64 (hs-test-10-20-30-16) 	01/12/20 	_x86_64_	(16 CPU)

08:22:45    kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit
08:22:48     19435768  25840412     57.07      4732    213608  28438956     51.23
08:22:51     19435788  25840392     57.07      4772    213816  28439804     51.23
08:22:54     19435844  25840336     57.07      4788    213840  28438968     51.23
08:22:57     19437232  25838948     57.07      4864    213860  28437596     51.22

  kafka之所以宣传磁盘比内存块,也是谙熟此原理。

  对于Linux下的进程如cp、scp、dd,大多数情况下在I/O时也会依赖操作系统的缓存特性,即预读模式(每次预读块数由参数)。如下:

  linux下直接I/O(direct io)深入解析与实践_Linux_02

  由此可见,内存浪费还是很严重的,此时一种做法就是执行命令echo 1 > /proc/sys/vm/drop_caches清空缓存,但是这容易误伤那些其实真的需要缓存的内容。

  很多应用则提供了选项供用户决定。下面列出了一些主要应用和命令对文件I/O的管理方式。

应用 默认模式(依赖OS或自身管理) 相关控制参数
oracle sga控制 filesystemio_options
mysql buffer_pool控制 innodb_flush_log_at_trx_commit
mongodb 依赖于操作系统  
c 通过open文件带上O_DIRECT参数,直接写到设备上。  
java 自己控制(文件读写除外),没有直接的参数控制  
kafka scala写的,依赖于JVM,所以本质上也是自己控制,文件则依赖于OS  
redis 自己控制 maxmemory,提供了超过后的策略
zookeeper 跟kafka类似  

   那cp、scp、tar这些命令操作的文件通常都比较大,且一般也不会被访问,如何实现直接I/O呢?

  那试试用dd进行复制,如下:

[root@hs-test-10-20-30-16 ~]# dd if=glibc-2.14.tar.gz of=glibc-2.14.tar.gz.new iflag=direct oflag=direct   
^C9270+0 records in
9270+0 records out
4746240 bytes (4.7 MB) copied, 38.6592 s, 123 kB/s

  其速度是在太慢了(受制于需要先定位,默认每次只读512字节)。改成每次4M。如下:

[root@hs-test-10-20-30-16 ~]# rm -rf glibc-2.14
[root@hs-test-10-20-30-16 ~]#  echo 1 > /proc/sys/vm/drop_caches
[root@hs-test-10-20-30-16 ~]# dd if=glibc-2.14.tar.gz of=glibc-2.14.tar.gz.new ibs=4M obs=4M iflag=direct oflag=direct
4+1 records in
4+1 records out
20897040 bytes (21 MB) copied, 0.366115 s, 57.1 MB/s
You have new mail in /var/spool/mail/root

[root@hs-test-10-20-30-16 ~]# sar -r 3 10000
Linux 2.6.32-431.el6.x86_64 (hs-test-10-20-30-16) 	01/12/20 	_x86_64_	(16 CPU)

09:14:12    kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit
09:14:15     19435044  25841136     57.07      1764    204420  28444528     51.24
09:14:18     19435516  25840664     57.07      2052    204376  28444144     51.24
09:14:21     19434392  25841788     57.08      2216    204640  28444184     51.24
09:14:24     19434300  25841880     57.08      2268    204768  28443632     51.23
09:14:27     19435184  25840996     57.07      2520    204892  28443596     51.23
09:14:30     19429620  25846560     57.09      2944    209396  28443532     51.23
09:14:33     19429732  25846448     57.09      3476    209476  28443532     51.23
09:14:36     19430160  25846020     57.09      3672    209724  28443532     51.23


[root@hs-test-10-20-30-16 ~]# tar xzvf glibc-2.14.tar.gz.new | more
glibc-2.14/
glibc-2.14/.gitattributes
glibc-2.14/.gitignore
glibc-2.14/BUGS
glibc-2.14/CANCEL-FCT-WAIVE
glibc-2.14/CANCEL-FILE-WAIVE
glibc-2.14/CONFORMANCE
glibc-2.14/COPYING

  这样通过dd就可以代替cp了,dd唯一遗憾的就是不支持批量文件复制。

     除了使用dd外,有个老外写了个nocache命令,能够使得cp/scp等命令在执行后立刻清理页面缓存。如下:

[root@hs-test-10-20-30-16 ~]# cd nocache-1.1
[root@hs-test-10-20-30-16 nocache-1.1]# ll
total 68
-rw-r--r-- 1 root root  1328 Dec 22  2018 COPYING
-rw-r--r-- 1 root root  1453 Dec 22  2018 Makefile
-rw-r--r-- 1 root root  6220 Dec 22  2018 README
lrwxrwxrwx 1 root root     6 Jan 12 11:42 README.md -> README
-rw-r--r-- 1 root root  1146 Dec 22  2018 cachedel.c
-rw-r--r-- 1 root root  2665 Dec 22  2018 cachestats.c
-rw-r--r-- 1 root root   889 Dec 22  2018 fcntl_helpers.c
-rw-r--r-- 1 root root   289 Dec 22  2018 fcntl_helpers.h
drwxr-xr-x 2 root root  4096 Dec 22  2018 man
-rw-r--r-- 1 root root 13962 Dec 22  2018 nocache.c
-rw-r--r-- 1 root root   391 Dec 22  2018 nocache.in
-rw-r--r-- 1 root root  3663 Dec 22  2018 pageinfo.c
-rw-r--r-- 1 root root   323 Dec 22  2018 pageinfo.h
drwxr-xr-x 2 root root  4096 Dec 22  2018 t
[root@hs-test-10-20-30-16 nocache-1.1]# make
cc -Wall   -o cachedel cachedel.c
cc -Wall   -o cachestats cachestats.c
cc -Wall   -fPIC -c -o nocache.o nocache.c
cc -Wall   -fPIC -c -o fcntl_helpers.o fcntl_helpers.c
cc -Wall   -fPIC -c -o pageinfo.o pageinfo.c
cc -Wall   -pthread -shared -Wl,-soname,nocache.so -o nocache.so nocache.o fcntl_helpers.o pageinfo.o -ldl
sed 's!##libdir##!$(dirname "$0")!' <nocache.in >nocache
chmod a+x nocache
You have new mail in /var/spool/mail/root
[root@hs-test-10-20-30-16 nocache-1.1]# ll
total 156
-rw-r--r-- 1 root root  1328 Dec 22  2018 COPYING
-rw-r--r-- 1 root root  1453 Dec 22  2018 Makefile
-rw-r--r-- 1 root root  6220 Dec 22  2018 README
lrwxrwxrwx 1 root root     6 Jan 12 11:42 README.md -> README
-rwxr-xr-x 1 root root  8192 Jan 12 11:43 cachedel
-rw-r--r-- 1 root root  1146 Dec 22  2018 cachedel.c
-rwxr-xr-x 1 root root  9948 Jan 12 11:43 cachestats
-rw-r--r-- 1 root root  2665 Dec 22  2018 cachestats.c
-rw-r--r-- 1 root root   889 Dec 22  2018 fcntl_helpers.c
-rw-r--r-- 1 root root   289 Dec 22  2018 fcntl_helpers.h
-rw-r--r-- 1 root root  2352 Jan 12 11:43 fcntl_helpers.o
drwxr-xr-x 2 root root  4096 Dec 22  2018 man
-rwxr-xr-x 1 root root   396 Jan 12 11:43 nocache
-rw-r--r-- 1 root root 13962 Dec 22  2018 nocache.c
-rw-r--r-- 1 root root   391 Dec 22  2018 nocache.in
-rw-r--r-- 1 root root 23064 Jan 12 11:43 nocache.o
-rwxr-xr-x 1 root root 26208 Jan 12 11:43 nocache.so
-rw-r--r-- 1 root root  3663 Dec 22  2018 pageinfo.c
-rw-r--r-- 1 root root   323 Dec 22  2018 pageinfo.h
-rw-r--r-- 1 root root  4472 Jan 12 11:43 pageinfo.o
drwxr-xr-x 2 root root  4096 Dec 22  2018 t
[root@hs-test-10-20-30-16 nocache-1.1]# ./nocache scp ~/a.dat 10.20.30.17:/tmp/
The authenticity of host '10.20.30.17 (10.20.30.17)' can't be established.
RSA key fingerprint is 7a:dc:af:bf:4d:e1:62:cd:b4:53:df:0f:6c:a0:55:26.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '10.20.30.17' (RSA) to the list of known hosts.
root@10.20.30.17's password: 
a.dat                                                                                                                                100% 4096MB 110.7MB/s   00:37    

11:48:04    kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit
11:48:07     19219744  26056436     57.55     10876    358108  28445004     51.24
11:48:10     19207224  26068956     57.58     10896    359204  28455348     51.26
11:48:13     19207668  26068512     57.58     11080    359236  28455352     51.26
11:48:16     18989400  26286780     58.06     11204    573024  28459352     51.26
11:48:19     18658576  26617604     58.79     11228    903856  28458964     51.26
11:48:22     18329272  26946908     59.52     11288   1231132  28459276     51.26
11:48:25     17987808  27288372     60.27     11312   1573436  28458672     51.26
11:48:28     17643640  27632540     61.03     11328   1917184  28458672     51.26
11:48:31     17304284  27971896     61.78     11464   2255576  28458584     51.26
11:48:34     16966124  28310056     62.53     11544   2593788  28458584     51.26
11:48:37     16623688  28652492     63.28     11552   2935548  28458584     51.26
11:48:40     16287292  28988888     64.03     11568   3273164  28458576     51.26
11:48:43     15952856  29323324     64.77     11840   3606952  28458576     51.26
11:48:46     15621704  29654476     65.50     11960   3936816  28460108     51.26
11:48:49     15309600  29966580     66.19     12016   4247848  28459196     51.26
11:48:52     16223364  29052816     64.17     12064   3333380  28458560     51.26
11:48:55     19219104  26057076     57.55     12092    359880  28442844     51.23
11:48:58     19217836  26058344     57.55     12220    359816  28442864     51.23

  可见,nocache并非绕过页面缓存,而是执行完成后把本命令导致的缓存释放,所以波动会比较大。

  除了使用命令行外,还可以通过cgroup memcg限制用户的使用量,但是因为pagecache是全局共享的(有可能是别的group申请的),所以memcg只能保证rss+pagecache不超过内存配额,不能单独限制每个group的pagecache用量,只是在内存不足时pagecache会被直接丢掉,虽然提供了group级别的统计功能https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt,这样要精确隔离的话,就得为这些命令建立单独一个cgroup。

[root@hs-test-10-20-30-16 ~]# yum install libcgroup
[root@hs-test-10-20-30-16 ~]# service cgconfig start
Starting cgconfig service:                                 [  OK  ]
[root@hs-test-10-20-30-16 ~]# cgcreate -a uft_trade_mysql -gmemory:memcg
#  as published by the Free Software Foundation.
#
#  This program is distributed in the hope that it would be useful, but
#  WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# See man cgconfig.conf for further details.
#
# By default, mount all controllers to /cgroup/<controller>

mount {
    cpuset    = /cgroup/cpuset;
    cpu    = /cgroup/cpu;
    cpuacct    = /cgroup/cpuacct;
    memory    = /cgroup/memory;
    devices    = /cgroup/devices;
    freezer    = /cgroup/freezer;
    net_cls    = /cgroup/net_cls;
    blkio    = /cgroup/blkio;
}

[root@hs-test-10-20-30-16 fs]# cat /cgroup/memory/    
cgroup.event_control             memory.limit_in_bytes            memory.memsw.usage_in_bytes      memory.swappiness                tasks
cgroup.procs                     memory.max_usage_in_bytes        memory.move_charge_at_immigrate  memory.usage_in_bytes            
memcg/                           memory.memsw.failcnt             memory.oom_control               memory.use_hierarchy             
memory.failcnt                   memory.memsw.limit_in_bytes      memory.soft_limit_in_bytes       notify_on_release                
memory.force_empty               memory.memsw.max_usage_in_bytes  memory.stat                      release_agent                    
[root@hs-test-10-20-30-16 fs]# cd /cgroup/memory/memcg/
[root@hs-test-10-20-30-16 memcg]# cat memory.limit_in_bytes 
9223372036854775807
[root@hs-test-10-20-30-16 memcg]# echo 1048576 > memory.limit_in_bytes 
[root@hs-test-10-20-30-16 memcg]# echo $$ > tasks  #将当前终端加入限制组
[root@hs-test-10-20-30-16 tmp]# cp a.dat.new ~/a.dat  # 限额1M时内存不足被OOM kill
Killed

Jan 12 20:26:23 hs-test-10-20-30-16 kernel: 11534334 pages RAM
Jan 12 20:26:23 hs-test-10-20-30-16 kernel: 215289 pages reserved
Jan 12 20:26:23 hs-test-10-20-30-16 kernel: 960676 pages shared
Jan 12 20:26:23 hs-test-10-20-30-16 kernel: 2365211 pages non-shared
Jan 12 20:26:23 hs-test-10-20-30-16 kernel: [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
Jan 12 20:26:23 hs-test-10-20-30-16 kernel: [15199]     0 15199     2894      459   2       0             0 bash
Jan 12 20:26:23 hs-test-10-20-30-16 kernel: [22192]     0 22192    28414      175   6       0             0 cp
Jan 12 20:26:23 hs-test-10-20-30-16 kernel: Memory cgroup out of memory: Kill process 15199 (bash) score 1 or sacrifice child
Jan 12 20:26:23 hs-test-10-20-30-16 kernel: Killed process 22192, UID 0, (cp) total-vm:113656kB, anon-rss:44kB, file-rss:656kB

[root@hs-test-10-20-30-16 ~]# cd /cgroup/memory/memcg/
[root@hs-test-10-20-30-16 memcg]# echo 104857600 > memory.limit_in_bytes 
[root@hs-test-10-20-30-16 memcg]# cat tasks 
15199
22651
[root@hs-test-10-20-30-16 memcg]# echo $$ > tasks 
[root@hs-test-10-20-30-16 memcg]# cp /tmp/a.dat.new ~/a.dat   # 大约66秒完成,不限额时大约56秒完成
cp: overwrite `/root/a.dat'? y
You have new mail in /var/spool/mail/root
[root@hs-test-10-20-30-16 memcg]# echo 10485760 > memory.limit_in_bytes 
[root@hs-test-10-20-30-16 memcg]# cp /tmp/a.dat.new ~/a.dat    # 大约84秒完成
cp: overwrite `/root/a.dat'? y
You have new mail in /var/spool/mail/root
[root@hs-test-10-20-30-16 memcg]# cd ../
08:27:50 PM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit
08:29:14 PM  18944964  26331216     58.16     86284    593212  28445048     51.24
08:29:17 PM  18945072  26331108     58.16     86412    593208  28443652     51.23
08:29:20 PM  18944588  26331592     58.16     86444    593284  28444056     51.24
08:29:23 PM  18866972  26409208     58.33     86484    670180  28443420     51.23
08:29:26 PM  18840408  26435772     58.39     86484    694804  28443504     51.23
08:29:29 PM  18837164  26439016     58.39     86496    694528  28443488     51.23
08:29:32 PM  18837288  26438892     58.39     86792    694452  28444000     51.24
08:29:35 PM  18837964  26438216     58.39     86804    694560  28444000     51.24
08:29:38 PM  18836368  26439812     58.40     86820    694600  28444804     51.24
08:29:41 PM  18836168  26440012     58.40     86864    694736  28443968     51.24
08:29:44 PM  18834656  26441524     58.40     87024    694700  28455748     51.26
08:29:47 PM  18834260  26441920     58.40     87176    695292  28454864     51.26
08:29:50 PM  18833284  26442896     58.40     87196    695076  28454864     51.26

[root@hs-test-10-20-30-16 ~]# sar -r 3 10000
Linux 2.6.32-431.el6.x86_64 (hs-test-10-20-30-16)     01/12/2020     _x86_64_    (16 CPU)

08:31:15 PM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit
08:31:18 PM  18920476  26355704     58.21     89628    604828  28445160     51.24
08:31:21 PM  18920328  26355852     58.21     89656    604948  28445160     51.24
08:31:24 PM  18918732  26357448     58.21     89688    605080  28444520     51.24

  从上可知cgroup也可以和dd一样,可以保证不额外占用pagecache,即使是用来过渡,但cgroup一定要保证有rss和anon外有一定的最小额外内存,否则容易发生OOM。

nginx中的直接I/O

   增加"directio 512k"指令后,nginx对于静态资源就会走直接I/O模式了,也只有增加了该指令才是真正的直接I/O。sendfile(它调用的是系统调用sendfile,但是sendfile默认情况下并不走NOCACHE_IO模式,所以仍然会走页面缓存,参见https://man.openbsd.org/FreeBSD-11.1/sendfile.2#SF_NOCACHE) on。所有要开启直接I/O,需要增加下列指令:

    aio threads;
    directio 512k;
    output_buffers 1 8m;   # 其对性能会有高达10%的影响,用于设置从磁盘读取缓冲区响应的数量和大小。

   其性能差异如下:

 linux下直接I/O(direct io)深入解析与实践_Linux_03

 

linux下直接I/O(direct io)深入解析与实践_Linux_04

   output_buffers为256k时,如下:

linux下直接I/O(direct io)深入解析与实践_Linux_05

   aio threads指令开启与否对性能影响不是很大。 

  当directio指令关闭时,如下:

linux下直接I/O(direct io)深入解析与实践_Linux_06

 

   虽然sendfile on;开启,但是pagecache仍然经过了。

linux下直接I/O(direct io)深入解析与实践_Linux_07

   对于文件读写,curl的--no-buffer选项好像并没有生效,这样要想实现低内存占用文件,就得使用java nio或c direct io自行实现。

 java实现直接I/O

  java虽然提供了大量的I/O接口,但是大部分上层接口都是基于文件系统缓存的。要操作底层就需要对NIO比较熟悉了,核心又是DirectBuffer和ByteBuffer,参见java nio深入理解之MMAP与ByteBuffer、DirectBuffer。github上封装了一个库https://github.com/lexburner/kdio/blob/master/README.md(主:其基于log4j实现,需要自行下载源码修改为slf4j)也是基于此+JNAStream(因为java原生并不支持O_DIRECT选项,这样有个缺点就是写大文件会占用pagecache、亦或性能受到影响,因为要每次映射一部分,参见java nio深入理解之通道Channel,O_SYNC是支持人工调用的)。关于文件I/O的一些优化实践可以参考:https://www.tuicool.com/articles/eEnIny6。

https://www.ibm.com/developerworks/cn/linux/l-cn-directio/(Linux下直接I/O的设计与实现)

花若盛开,蝶自飞来,你若精彩,幸福开怀!2020年12月11日-18日