(译)dd, bs= and why you should use conv=fsync

整理自这篇文章 https://abbbi.github.io/dd/

简单总结

If one uses dd with a bigger block size (>= 4096), be sure to use either the oflag=direct or conv=fsync option to have proper error reporting while writing data to a device. I would prefer conv=fsync, dd will then fsync() the file handle once and report the error, without having the performance impact which oflag=direct has.

用dd的时候尽可能用conv=fsync提前发现system error

作者用dd来做测试，测试盘有坏块的场景

预备工作

 truncate -s 1G /tmp/baddisk
 losetup /dev/loop2 /tmp/baddisk
 dmsetup create baddisk << EOF 
    0 6050 linear /dev/loop2 0
    6050 155 error
    6205 2090947 linear /dev/loop2 6205 
 EOF

可以看到设置之后的盘的属性

fdisk -l

磁盘 /dev/loop2：1073 MB, 1073741824 字节，2097152 个扇区
Units = 扇区 of 1 * 512 = 512 bytes
扇区大小(逻辑/物理)：512 字节 / 512 字节
I/O 大小(最小/最佳)：512 字节 / 512 字节

可以看到每个扇区是0.5KB

写到错误的位置，也就是6050，需要3M（6050*0.5k）所以，我们调用dd写入4M，肯定就写到错误的地方，就会有报错

但是实际上没有任何报错算一下 4096*1000就是4M

dd if=/dev/zero of=/dev/mapper/baddisk bs=4096 count=1000
  4096000 bytes (4.1 MB, 3.9 MiB) copied, 0.0107267 s, 382 MB/s

如果不指定bs就会报错，一直写

dd if=/dev/zero of=/dev/mapper/baddisk
 dd: writing to '/dev/mapper/baddisk': Input/output error
 3096576 bytes (3.1 MB, 3.0 MiB) copied, 0.0238947 s, 130 MB/s

抓dmesg的信息，也是有报错的

dmesg
[8807366.717526] Buffer I/O error on device dm-0, logical block 766
[8807366.718560] lost page write due to I/O error on dm-0

为什么dd命令不报错？

strace抓信息

我这里抓的是这样的

open("/dev/mapper/baddisk", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
dup2(3, 1)                              = 1
close(3)  = 0
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096

可见打开文件读写都没遇到报错，如果强制加上O_DIRECT O_SYNC之类的符号，就会报错了

背后的细节问题： Linux内核buffered IO影响，有buffered IO，写入不是立即的

而且，对于buffered IO遇到的硬件异常，api是不能立刻感知到的，只有写回的时候才会感知到，所以才有这个问题

指定两个FLAG中的一个就解决了这个问题，oflag=direct慢一鞋，相当于O_SYNC

O_DIRECT组合，conv=fsync更好一些

buffered-io原理

当然面对这个问题，即buffered IO写入遇到硬件层异常，在写回时才出错，如何提前感知错误，也有很多讨论

比如这个SO问题

答主Craig Ringer 也是pg开发人员，遇到了这个问题，解决方案就是用fsync要检查错误

引述一下他的回答：如果以为用fsync（循环调用fsync直到成功）就万事大吉，那就错了

换句话说如果fsync遇到错误，那就是硬件有问题，应该abort退出

which is then detected by wait_on_page_writeback_range(...) as called by do_sync_mapping_range(...) as called by sys_sync_file_range(...) as called by sys_sync_file_range2(...) to implement the C library call fsync().

But only once!

This comment on sys_sync_file_range
168  * SYNC_FILE_RANGE_WAIT_BEFORE and SYNC_FILE_RANGE_WAIT_AFTER will detect any
169  * I/O errors or ENOSPC conditions and will return those to the caller, after
170  * clearing the EIO and ENOSPC flags in the address_space.
suggests that when fsync() returns -EIO or (undocumented in the manpage) -ENOSPC, it will clear the error state so a subsequent fsync() will report success even though the pages never got written.

Sure enough wait_on_page_writeback_range(...) clears the error bits when it tests them:
301         /* Check for outstanding write errors */
302         if (test_and_clear_bit(AS_ENOSPC, &mapping->flags))
303                 ret = -ENOSPC;
304         if (test_and_clear_bit(AS_EIO, &mapping->flags))
305                 ret = -EIO;
So if the application expects it can re-try fsync() until it succeeds and trust that the data is on-disk, it is terribly wrong.

另外在4.9之后的新内核，直接返回EIO，综上，要判定fsync返回EIO

作者的验证代码在这 https://github.com/ringerc/scrapcode/blob/fd71dffea787847d303e22db95e8b6ca23d06a6d/testcases/fsync-error-clear/standalone/fsync-error-clear.c

作者在PG的讨论串在这里https://www.postgresql.org/message-id/flat/CAMsr%2BYE5Gs9iPqw2mQ6OHt1aC5Qk5EuBFCyG%2BvzHun1EqMxyQg%40mail.gmail.com#CAMsr+YE5Gs9iPqw2mQ6OHt1aC5Qk5EuBFCyG+vzHun1EqMxyQg@mail.gmail.com

另外LWN有两个帖子

https://lwn.net/Articles/724307/

https://lwn.net/Articles/752063/

也介绍了关于这个问题相关的内核应该做的改动，更好的IO错误处理，此处不提

ref

https://hustcat.github.io/blkcg-buffered-io/ 博客也不错，好像没做seo