valgrind 3.10 日志是这样的

valgrind: m_syswrap/syswrap-linux.c:5255 (vgSysWrap_linux_sys_fcntl_before): Assertion 'Unimplemented functionality' failed.
valgrind: valgrind

host stacktrace:
==26531==    at 0x3805DC16: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==26531==    by 0x3805DD24: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==26531==    by 0x3805DEA6: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==26531==    by 0x380D53A0: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==26531==    by 0x380B0834: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==26531==    by 0x380AD242: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==26531==    by 0x380AE6F6: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==26531==    by 0x380BDD7C: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable
==26531==    at 0x60903A4: fcntl (in /usr/lib64/libpthread-2.17.so)
==26531==    by 0x53571F6: rocksdb::PosixWritableFile::SetWriteLifeTimeHint(rocksdb::Env::WriteLifeTimeHint) (io_posix.cc:897)
...

找到rocksdb代码,接口是这样的

env/io_posix.cc

void PosixWritableFile::SetWriteLifeTimeHint(Env::WriteLifeTimeHint hint) {
#ifdef OS_LINUX
// Suppress Valgrind "Unimplemented functionality" error.
#ifndef ROCKSDB_VALGRIND_RUN
  if (hint == write_hint_) {
    return;
  }
  if (fcntl(fd_, F_SET_RW_HINT, &hint) == 0) {
    write_hint_ = hint;
  }
#else
  (void)hint;
#endif // ROCKSDB_VALGRIND_RUN
#else
  (void)hint;
#endif // OS_LINUX
}

…结果显然了,不支持fcntl F_SET_RW_HINT选项。

注意:

  • 如果需要跑valgrind,编译的rocksdb需要定义ROCKSDB_VALGRIND_RUN
  • 如果有必要,最好也定义PORTABLE,默认是march=native可能会遇到指令集不支持

特意搜了一下,这个参数是这样定义的

env/io_posix.cc

#if defined(OS_LINUX) && !defined(F_SET_RW_HINT)
#define F_LINUX_SPECIFIC_BASE 1024
#define F_SET_RW_HINT (F_LINUX_SPECIFIC_BASE + 12)
#endif

在linux中是这样的

https://github.com/torvalds/linux/blob/dd5001e21a991b731d659857cd07acc7a13e6789/include/uapi/linux/fcntl.h#L53

/*
 * Set/Get write life time hints. {GET,SET}_RW_HINT operate on the
 * underlying inode, while {GET,SET}_FILE_RW_HINT operate only on
 * the specific file.
 */
#define F_GET_RW_HINT		(F_LINUX_SPECIFIC_BASE + 11)
#define F_SET_RW_HINT		(F_LINUX_SPECIFIC_BASE + 12)
#define F_GET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 13)
#define F_SET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 14)

另外这里也移植了一个https://github.com/riscv/riscv-gnu-toolchain/blob/master/linux-headers/include/linux/fcntl.h

write life time hints

搜到了这个实现的patch https://patchwork.kernel.org/patch/9794403/

和这个介绍,linux 4.13引入的https://www.phoronix.com/scan.php?page=news_item&px=Linux-4.13-Write-Hints

简单说,这是为NVMe加上的功能,暗示写入数据是ssd的话,调度就会把数据尽可能靠近,这样方便后续回收

patchset讨论里还特意说到了rocksdb。。。https://lwn.net/Articles/726477/

降低写放大十分可观

A new iteration of this patchset, previously known as write streams. As before, this patchset aims at enabling applications split up writes into separate streams, based on the perceived life time of the data written. This is useful for a variety of reasons:

  • For NVMe, this feature is ratified and released with the NVMe 1.3 spec. Devices implementing Directives can expose multiple streams. Separating data written into streams based on life time can drastically reduce the write amplification. This helps device endurance, and increases performance. Testing just performed internally at Facebook with these patches showed up to a 25% reduction in NAND writes in a RocksDB setup.

  • Software caching solutions can make more intelligent decisions on how and where to place data.

这背后又有一个NVMe特性,Stream ID,https://lwn.net/Articles/717755/

一个用法,见pg的patch以及讨论 https://www.postgresql.org/message-id/CA%2Bq6zcX_iz9ekV7MyO6xGH1LHHhiutmHY34n1VHNN3dLf_4C4Q%40mail.gmail.com

这里还没有更深入讨论。只是罗列了资料。下班了,就到这里

ref

contacts