大致看了下mysql 的double write buffer概念 :

So why doublewrite is needed ?It is needed to archive data safety in case of partial page writes. Innodb does not log full pages to the log files, but uses what is called “physiological” logging which means log records contain page number for the operation as well as operation data (ie update the row) and log sequence information. Such logging structure is geat as it require less data to be written to the log, however it requires pages to be internally consistent. It does not matter which page version it is – it could be “current” version in which case Innodb will skip page upate operation or “former” in which case Innodb will perform update. If page is inconsistent recovery can’t proceed.

Now lets talk a bit about partial page writes – what are they and why are they happening. Partial page writes is when page write request submited to OS completes only partially. For example out of 16K Innodb page only first 4KB are updated and other parts remain in their former state. Most typically partial page writes happen when power failure happens. It also can happen on OS crash – there is a chance operation system will split your 16K write into several writes and failure happens just between their execution. Reasons for splitting could be file fragmentation – most file systems use 4K block sizes by default so 16K could use more than one fragment. Also if software RAID is used page may come on the stripe border requiring multiple IO requests. Same happens with Hardware RAID on power failure if it does not have battery backed up cache. If there is single write issued to the disk itself it should be in theory completed even if power goes down as there should be enough power accomulated inside the drive to complete it. I honestly do not know if this is always the case – it is hard to check as it is not the only reason for partial page writes. I just know they tend to happen and before Innodb doublewirite was implemented I had couple of data corruptions due to it.

So how does double write works ? You can think about it as about one more short term log file allocated inside Innodb tablespace – it contains space for 100 pages. When Innodb flushes pages from Innodb buffer pool it does so by multiple pages. So several pages will be written to double write buffer (sequentially), fsync() called to ensure they make it to the disk, then pages written to their real location and fsync() called the second time. Now on recovery Innodb checks doublewrite buffer contents and pages in their original location. If page is inconsistent in double write buffer it is simply discarded, if it is inconsistent in the tablespace it is recovered from double write buffer.

我是这样认为的 其实oracle中也有写 partial page(block)的可能,只不过 o的redo和controlfile 检查点比较严谨, 不在需要这样一个区域了。

对于 partial page(block) 的前滚,o也有这样的情况,例如 一个已经写好的块, 增量检查点 未必更新了控制文件, o认为 这个块不是最新的, 那么o也会同样的再写一次 ,即便这个块已经是最新的。 这种机制在oracle中称为resilver write 镀银写。

Most data blocks are changed via redo and written by DBWR. These writes are coordinated by cache locks that insure there is only one current dirty buffer for any given block. If DBWR dies for any reason, its instance will also die. It will be necessary to do some form of recovery applying redo to reconstruct the blocks that were in the cache at the time of the failure. This recovery will have to either read or write any block that is both modified by redo, and might have been in the middle of a write when DBWR died. Thus, these blocks are a superset of the blocks that Oracle must resilver if it is responsible for resilvering. Hence, during recovery, Oracle will rewrite every block it examines in files it is responsible for resilvering. Since media recovery may be used to recover changes lost when an instance dies, Oracle must also resilver when doing media recovery.

Data blocks used for sorting are modified without generating any redo. They are still written by DBWR. These blocks are never read by any process other than the one doing the sort, and if its DBWR dies then it too will die. Thus it is not important that these blocks become resilvered.

mysql里叫double write, o 里叫 resilver write。 目的应该是一样的,机制不同,mysql 里知名度很高 可能是因为其可以关闭。 o里面就没人研究能不能关这个了。

补充:  double write的核心意义在于减少partial page writes的可能性,特别是对于instance crash导致的partially written;其原理是依赖于两次先后的fsync()来保证实际的数据落盘。

对于Oracle而言,确实是可能存在partially written block的现象的,这种现象o中体现为fractured block块断裂 : A fractured block is one that was partially written to disk, leaving a previous version on part of the block.

对于mysql中的double write,其也只能减少partial page writes的可能性,是并不能杜绝的,因为其还是依赖于fsync()。假设我们有一个调皮的存储子系统,对于第一次的fsync()尽职尽责地完成了,而对于第二次的fsync()只完成了一半就成功返回了。此时如果没有发生instance crash,那么也就没有对应的crash recovery发生。MySQL应当也不会去检查double write buffer和对应的page。那么还是会造成partially written。或者在MySQL Recovery情况下检测double write buffer和page内容是读取了文件系统或存储系统内被更新的缓存,那么还是会骗过这次检测。

double write更多的是为了保护crash recovery情况下可能发生的partially written;并不能完全杜绝存储子系统去“欺骗”它。

不管是Oracle还是MySQL,基本都基于现代存储子系统是基本可靠的这一点出来来设计。

对于Oracle而言,如果在crash recovery阶段发现了fractured block块断裂,那么确实它也是无法基于redo去恢复它的。用户也需要基于backup去恢复对应的数据块。

因为oracle的redo也是仅仅记录redo vector的,而不记录block的full image,仅仅在做hot backup是记录block full image。

oracle中引入了参数_db_writer_verify_writes :Enable lost write detection mechanism

Parameter Type:string

Allowable Values: TRUE, LIOCHK, SCNCHK and FALSE.

Typical Values: User would use a value of either TRUE or FALSE.

Default Value: FALSE.

Dangerous Values: NA.

Description: This parameter will performs lost write detection.

Dynamic: No.

Syntax: _db_writer_verify_writes =

Algorithm: This parameter defaults to FALSE at startup. If set to LIOCHK we read the bunch of blocks, written by this DBWR process recently, into a PGA array and check the complete contents of the block with a memcmp. If set to SCNCHK we log the SCN of a block change into the simulated buffer header and check it back on read if the simulated buffer is still in the cache. If set to TRUE both SCNCHK and LIOCHK are performed. The performance overhead fromboth these changes is minimal.

简而言之,这个参数做的就是对DBWR写出的block,马上读取到PGA中做一个验证。这样能发现如下的情况:

fractured block 即写了一部分,partially written

Lost Write,写丢失

在Oracle中fractured block可以被算作physical corruption物理损坏,而Lost Write写丢失一般被认为是逻辑损坏。 写丢失是指这个该被写出更新的数据块一个字节都没更新,IO子系统欺骗上层DB引擎说已经写出了。

非instance crash情况下的partially written部分写块和lost write完全没写,其实是不多的;Mysql的double write机制保证了在crash recovery场景中存储子系统基本没问题的情况下,partially written和lost write不发生。

同时由于MySQL所处的存储子系统环境的复杂度往往没有oracle高,大部分采用本地磁盘或者没有复杂镜像基础的存储技术。而oracle中我们在采用大量存储镜像基础的用户哪里,是能明确遇到一些由于存储镜像技术问题造成的写丢失问题的:即Oracle写入到主镜像,存储成功返回;在后续的操作中Oracle读取到这些数据块,却没有被更新,因为读取到了没更新的存储镜像。这些故障的大量表象是ORA-08103,ORA-01410和一些ORA-00600报错。对于普通的存放ROW的block而言,lost write发生后甚至没有任何报错,也几乎无法让人察觉到,因为这个块本身内容是完全合法的。

在Oracle 11g中引入了db_lost_write_protect 参数来进一步防御写丢失,但该特性需要一般依赖于物理备库的存在。