occ,2pl 以及其他概念

11 Apr 2019 | |

why

补课

数据库并发控制(Concurrency Control)实现，锁和时间序

基于锁，也就是 Two Phrase Locking，2PL

2PL

Growing Phrase 获取锁事务内修改，而不会导致冲突
Shrinking Phrase 释放锁

缺陷，业务依赖性死锁，无法避免。

基于时间戳，可以实现乐观并发控制（OCC，Optimistic Concurrency Control）和MVCC

时间序，默认事务不冲突，检查时间超前，事务失败。

Read Phase，或者叫execute phase更合适，读，Read Set，写写入临时副本，放到Write Set
Validation Phase，重扫 Read Set， Write Set，验证隔离级别，满足commit，否则abort
Write Phase，叫Commit Phase更合适，提交

而MVCC有多种实现，通常都是多快照+时间戳维护可见性，两种实现

MV-2PC mysql
MV-TO， postgresql （SSI）

主要操作

Update 创建一个version
Delete，更新End timestamp
Read，通过Begin End timestamp判断可见性

快照保证读写不阻塞，为了串行化还是要限制读写顺序

隔离程度以及影响

影响：脏读不可重复读幻读更新丢失

隔离程度

串行化	可重复读RR	提交读RC	未提交度RU
	幻读	不可重复读，幻读	脏读，不可重复读，幻读

快照隔离（SI）串行化

Snapshot Isolation 在 Snapshot Isolation 下，不会出现脏读、不可重复度和幻读三种读异常。并且读操作不会被阻塞，对于读多写少的应用 Snapshot Isolation 是非常好的选择。并且，在很多应用场景下，Snapshot Isolation 下的并发事务并不会导致数据异常。所以，主流数据库都实现了 Snapshot Isolation，比如 Oracle、SQL Server、PostgreSQL、TiDB、CockroachDB

虽然大部分应用场景下，Snapshot Isolation 可以很好地运行，但是 Snapshot Isolation 依然没有达到可串行化的隔离级别，因为它会出现写偏序（write skew）。Write skew 本质上是并发事务之间出现了读写冲突（读写冲突不一定会导致 write skew，但是发生 write skew 时肯定有读写冲突），但是 Snapshot Isolation 在事务提交时只检查了写写冲突。

为了避免 write skew，应用程序必须根据具体的情况去做适配，比如使用SELECT … FOR UPDATE，或者在应用层引入写写冲突。这样做相当于把数据库事务的一份工作扔给了应用层。 Serializable Snapshot Isolation 后来，又有人提出了基于 Snapshot Isolation 的可串行化 —— Serializable Snapshot Isolation，简称 SSI（PostgreSQL 和 CockroachDB 已经支持 SSI）。为了分析 Snapshot Isolation 下的事务调度可串行化问题，有论文提出了一种叫做 Dependency Serialization Graph (DSG) 的方法（可以参考下面提到的论文，没有深究原始出处）。通过分析事务之间的 rw、wr、ww 依赖关系，可以形成一个有向图。如果图中无环，说明这种情况下的事务调度顺序是可串行化的。这个算法理论上很完美，但是有一个很致命的缺点，就是复杂度比较高，难以用于工业生产环境。

Weak Consistency: A Generalized Theory and Optimistic Implementations for Distributed Transactions 证明在 Snapshot Isolation 下, DSG 形成的环肯定有两条 rw-dependency 的边。 Making snapshot isolation serializable 再进一步证明，这两条 rw-dependency 的边是“连续”的（一进一出）。后来，Serializable Isolation for snapshot database 在 Berkeley DB 的 Snapshot Isolation 之上，增加对事务 rw-dependency 的检测，当发现有两条“连续”的 rw-dependency 时，终止其中一个事务，以此避免出现不可串行化的可能。但是这个算法会有误判——不可以串行化的事务调用会出现两条“连续”的 rw-dependency 的边，但是出现两条“连续”的 rw-dependency 不一定会导致不可串行化。

Serializable Snapshot Isolation in PostgreSQL 描述了上述算法在 PostgreSQL 中的实现。上面提到的 Berkeley DB 和 PostgreSQL 的 SSI 实现都是单机的存储。A Critique of Snapshot Isolation 描述了如何在分布式存储系统上实现 SSI，基本思想就是通过一个中心化的控制节点，对所有 rw-dependency 进行检查，有兴趣的可以参考论文。

读写冲突，写写冲突，写偏序，黑白球，串行化，以及SSI

参考链接¹⁹

occ silo见参考链接12，有很多细节，epoch，memory fence，masstree，occ等

或者到博客上提issue 我能收到邮件提醒。

reference

隔离级别，SI，SSI https://www.jianshu.com/p/c348f68fecde
mysql 各种读 https://www.jianshu.com/p/69fd2ca17cfd
occ mvcc区别 https://www.zhihu.com/question/60278698
并发控制的前世今生？ http://www.vldb.org/pvldb/vol8/p209-yu.pdf
数据库管理系统，并发控制简介 https://zhuanlan.zhihu.com/p/20550159
myrocks 事务实现 http://mysql.taobao.org/monthly/2016/11/02/
myrocks ddl https://www.cnblogs.com/cchust/p/6716823.html
rocksdb transactiondb分析https://yq.aliyun.com/articles/257424
cockroachdb 用rocksdb（后半段）https://blog.csdn.net/qq_38125183/article/details/81591285
cockroachdb 用rocksdb http://www.cockroachchina.cn/?p=1242
rocksdb 上锁机制上面的文章也提到了相关的死锁检测https://www.cnblogs.com/cchust/p/7107392.html
occ-silo 讲occ高性能https://www.tuicool.com/articles/VZVFnaR
分布式事务，文章不错http://www.zenlife.tk/distributed-transaction.md
再谈事务隔离性 https://cloud.tencent.com/developer/news/233615
事务隔离级别SI到RC http://www.zenlife.tk/si-rc.md
mvcc事务机制，SI http://www.nosqlnotes.com/technotes/mvcc-snapshot-isolation/
mvcc事务机制，逻辑时钟 http://www.nosqlnotes.com/technotes/mvcc-logicalclock/
mvcc 混合逻辑时钟http://www.nosqlnotes.com/technotes/mvcc-hybridclock/
cockroach mvcc http://www.nosqlnotes.com/technotes/cockroach-mvcc/

fancy pointers for fun and profit

06 Apr 2019 | |

演讲主题是fancy pointer 作者称之为synthetic pointer

reference

https://github.com/joboccara/NamedType

或者到博客上提issue 我能收到邮件提醒。

allocator Is to Allocation what vector Is to Vexation

06 Apr 2019 | |

演讲主题 Allocator的设计历史，AA主讲，标题也是够讽刺哈哈，其实概括的说allocator是设计错误（当初对virtual引入标准库还有抵触，觉得不够zero cost），才有c++17的 std::pmr

从malloc讲起

void* malloc(size_t size);
void free(void* p);

调用malloc需要记住size，free不需要，但是内部是有trick记住size的 -> allocator必须知道size

改进方案 0.1

struct blk {void* ptr; size_t length;};
struct blk malloc(size_t size);
void free(struct blk block);

新方案 operator new api多种多样，可以带size

问题

无法和malloc结合使用
指定类型
有奇怪的语法(指的placement new？)
和构造函数没通信
数组new带来的分歧(也可以算到奇怪语法里)

提案N3536（problem小节）还提到 delete 不带size的，对于一些allocator可能存在的性能问题(不提供size，可能就需要allocator存size，或者按块存储的，就得搜一遍块)，以及新增 fix

然后引入std::allocator，之所以不是个好的allocator主要还是设计问题

类型参数T引入的麻烦
- 对标准的理解分歧
- allocator成了了factory
- 实际上还是void*
- allocator应该以block为单位
- rebind<U>::other邪恶到家了
无状态
- 甚至是个全局单例 monostate
复杂问题：组合
- 通常allocator都是各种size块组合的，结合着各种list tree，freelist。如何组合，以及调试，观察状态都是问题

重新设计

效率
- 给调用方size信息
- scoped allocation patterns
- Thread-Local allocation patterns
特性
- 更好的配置(debug/stat)
- 特化，适配
- no legacy, no nonsense

template <class Primary, class Fallback>
class FallbackAllocator
	: private Primary
	, private Fallback {
public:
	blk allocate(size_t);
	void deallocate(blk);
};

Primary和Fallback都是allocator，Fallback保底，这就有个区分问题，需要各自实现owns 函数方便Allocator调用, 当然，最起码需要定义一个，依赖MDFINAE : Method Definitions Failure Is Not an Error

template <class P, class F>
blk FallbackAllocator<P, F>::allocate(size_t n) {
	blk r = P::allocate(n);
	if (!r.ptr) r = F::allocate(n);
	return r;
}
template <class P, class F>
void FallbackAllocator<P, F>::deallocate(blk b) {
	if (P::owns(b)) P::deallocate(b);
	else F::deallocate(b);
}
template <class P, class F>
bool FallbackAllocator::owns(blk b) {
	return P::owns(b) || F::owns(b);
}

手把手教你写stackallocator

template <size_t s> class StackAllocator {
	char d_[s];
	char* p_;
	StackAllocator() : p_(d_) {}
	nlk allocate(size_t n) {
		auto n1 = roundToAligned(n);
		if (n1 > (d_ + s) - p_ ) {
			return { nullptr, 0 };
		}
		blk result = { p_ , n };
		p_ += n1;
		return result;
	}
	
    void deallocate(blk b) {
		if (b.ptr + roundToAligned(n) == p_ ) {
			p_ = b.ptr;
		}
	}
	bool owns(blk b) {
		return b.ptr >= d_ && b.ptr < d_ + s;
	}
	// NEW: deallocate everything in O(1)
	void deallocateAll() {
		p_ = d_ ;
	}
...
};

手把手教你写freelist

template <class A, size_t s> class Freelist {
	A parent_ ;
	struct Node { Node * next; };
    Node* root_ ;
public:
	blk allocate(size_t n) {
		if (n == s && root_ ) {
			blk b = { root_ , n };
			root_ = root_.next;
			return b;
		}
		return parent_.allocate(n);
	}
	bool owns(blk b) {
		return b.length == s || parent_.owns(b);
	}
	void deallocate(blk b) {
		if (b.length != s) return parent_.deallocate(b);
		auto p = (Node * )b.ptr;
		p.next = root_ ;
		root_ = p;
	}
...
};

还可以改进，比如min max范文，allocate in batch等

添加调试信息

template <class A, class Prefix, class Suffix = void>
class AffixAllocator;

添加适当的前后缀参数，相当于模板装饰器了

类似的

template <class A, ulong flags>
class AllocatorWithStats;

手机各种原语调用，错误信息，内存使用信息，调用（时间行数文件等等）等

Bitmapped block

相当于全是静态的块

template <class A, size _ t blockSize>
class BitmappedBlock;

已经定义好的块大小
比malloc简单
多线程不友好

CascadingAllocator

template <class Creator>
class CascadingAllocator;
...
auto a = cascadingAllocator([]{
return Heap<...>();
});

一堆分配器，涨的慢
粒度大
线性查找

Segregator

分离，感觉像是多个freelist组合的感觉

template <size_t threshold, class SmallAllocator, class LargeAllocator>
struct Segregator;

• 以 threshold作为分界，派发给SmallAllocator或者LargeAllocator

甚至可以自组合，控制粒度

typedef Segregator<4096,
	Segregator<128,
		Freelist<Mallocator, 0, 128>,
		MediumAllocator>,
	Mallocator>
Allocator;

也可以组合各种搜索策略，但是被size限制住了

Bucketizer

这个单纯就是size桶了

template <class Allocator,	size_t min, size_t max, size_t step>
struct Bucketizer;

• [min, min + step), [min + step, min + 2*step)… • 个数有限

上面就是主流allocator 策略了

allocator的复制策略

allocator独立无状态，可复制，移动
不可复制 &移动，比如StackAllocator
可移动不可复制，没有存堆的成员就行了
可移动，引用计数

还有其他粒度上的控制，比如类型控制，工厂函数，设计，block设计等。不在列举

using FList = Freelist<Mallocator, 0, -1>;
using A = Segregator<
	8, Freelist<Mallocator, 0, 8>,
	128, Bucketizer<FList, 1, 128, 16>,
	256, Bucketizer<FList, 129, 256, 32>,
	512, Bucketizer<FList, 257, 512, 64>,
	1024, Bucketizer<FList, 513, 1024, 128>,
	2048, Bucketizer<FList, 1025, 2048, 256>,
	3584, Bucketizer<FList, 2049, 3584, 512>,
	4072*1024, CascadingAllocator<decltype(newHeapBlock)>,
	Mallocator
>;

总结

Fresh approach from first principles
Understanding history
- Otherwise: “…doomed to repeat it”.
Composability is key

reference

https://github.com/CppCon/CppCon2015/tree/master/Presentations/allocator%20Is%20to%20Allocation%20what%20vector%20Is%20to%20Vexation
提到了cppcon2014 Making Allocators Work 需要翻出来看一下
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3536.html
也是神奇，搜monostate搜出来这个https://en.cppreference.com/w/cpp/utility/variant/monostate

或者到博客上提issue 我能收到邮件提醒。

Avoiding Disasters with Strongly Typed C++

05 Apr 2019 | |

演讲主题类型歧义以及强类型解决方案

典型场景

foo(int index,int offset);

很容易把参数记错。类似的，bool hell，一堆bool类型函数

解决办法就是使用结构体，加强类型，见参考链接1,2

具体就是在基本类型的基础上封装上各种各样的policy类，和get接口，进一步，对各种量纲做类型traits

11

然后介绍了std::chrono中的量纲 std::ratio, 类似的，利用std::ratio能实现一些其他的量纲

reference

https://github.com/joboccara/NamedType
https://github.com/foonathan/type_safe
这里有个std::ratio 实现量纲分析的用法，议题仍是那个TMP书里讨论的量纲问题https://benjaminjurke.com/content/articles/2015/compile-time-numerical-unit-dimension-checking/

或者到博客上提issue 我能收到邮件提醒。

State Machines Battlefield-Naive vs STL vs Boost

04 Apr 2019 | |

演讲主题是对比状态机各种实现上的效率，源代码²，项目文档¹，ppt³见参考链接

简单说，SML在各种benchmark比较上没拖后腿，然后列举了各种实现上的优缺点

具体的比较图标还是看ppt³吧，一张一张截图太费劲了，这里主要把各种实现的优缺点列一下，所有代码实现见参考链接⁴

if/else状态机

(+) 内联
(+) 没有堆内存分配
(~) 内存占用小
(-) 不可复用 if-else hell

switch/enum状态机

(+) 内联
(+) 没有堆内存分配
(~) 内存占用小
(-) 不可复用

继承状态模式

(+) 容易复用，扩展(重写接口即可)
(~) 内存占用稍高
(-) 堆内存分配
(-) 无法内联，效率不行

std::variant + std::visit

(+) 内存占用低，高效
(+) 集成了std::expected
(~) 内联 (clang)
(-) 无法复用，类似switch/enum，类型加强了

coroutines + loop

(+) c++ 特性，组织好
(+) 很容易切换成同步异步的版本
(~) 学习曲线高，和上面的思路不同
(~) 堆内存 (heap elision / devirtualization)
(~) 隐式状态(需要提供函数默认行为)
(-) 所有事件都是相同的类型
(-) 奇怪的死循环写法

coroutines + goto

(+) 没有死循环
(+) 显式状态
(-) goto

coroutines + functions + variant

把死循环放到函数里，co_return 函数

(+) 容易维护，添加新事件状态
(+) 类型安全事件

boost statechart

(+) UML
(~) 学习曲线高，类似状态模式，写接口
(-) 动态类型
(-) 动态派发
(-) 高内存使用

boost msm

(+) UML 声明模式
(+) 分派实现，jump table
(+) 内存使用很小
(~) 学习曲线高
(~) DSL
(-) 宏
(-) 编译时间漫长
(-) 错误信息恐怖

boost sml

现代，可指定多种分派策略 jump table / nested switch / fold expressions

(+) UML 声明模式
(+) 编译期定制
(+) 内联，O1分派
(+) 编译速度快
(+) 占用内存小
(~) DSL
(~) 学习曲线高

reference

或者到博客上提issue 我能收到邮件提醒。

A Semi Compile Run-time Map with Nearly Zero Overhead Lookup

03 Apr 2019 | |

演讲主题一个静态map，保证O1查询，动态修改，还没有碰撞问题

作者列出了一个场景，自己使用的是static std::unordered_map，然后经常调用try_emplace，这会有碰撞问题

干脆直接搞一个compile-time map，比如 “constexpr all the things”³提到的 cx::map 或者boost::hana::map

但是场景要求，运行时改动，编译期查询，不需要提前知道kv对

针对这种需求，设计KV，要考虑k是非类型模板参数NTTP，但是这个c++20才支持，解决办法和boost::hana用的技巧相同，一个lambda包起来，然后把这个lambda key转成type key，兜一圈

template <auto...> struct dummy_t{};
template <typename Lambda>
constexpr auto key2type(Lambda lambda){
  return dummy_t<lambda()>{};
}
#define ID(x) []() constexpr {return x;}
//map<decltype(key2type(ID(5)))>

对于字符串还是有点问题，需要把foo展开成dummy_t<f,o,o>

template <typename Lambda, std::size_t... I>
constexpr auto str2type(Lambda lambda, std::index_sequence<I...>){
    return dummy_t<lambda()[i]...>{};
}

template <typename Lambda>
constexpr auto key2type(Lambda lambda){
  return array2type(lambda,std::make_index_sequence<strlen(lambda())>{});
}

这代码写的，make_index_sequence生成一组序列，然后dummy_t里的lambda()[I]…正好是数组展开，“foo” -> “f”, ‘o’,’o’ 写的真是绝了(好像我在那见过boost::hana用过一样的技术)

整体实现大框如下

template <typename Key, typename Value>
class static_map{
public:
  template <typename Lambda>
  static Value& get(Lambda lambda){
    static_assert(std::is_convertialb_v<decltype(lambda()),Key>);
    return get_internal<decltype(key2type(lambda))>();
  };
private:
  template <typename>
  static Value& get_internal(){
    static Value value;
    return value;
  }
};

这实际上还是个静态表，没有动态能力，实现方案还是加个std::unordered_map，加在get_internal存指针，如果值变了，直接placement new，这个方案还是有unordered_map的问题，调用开销。不能放在get_interal

最终方案就是placement new了，内部数组保存value(根据Value类型可能有多分)，和一个runtime_map，这个map保存key和value数组指针，init_flag用来维护初始化

struct ConstructorInvoker{
    constructorInvoker(char* mem){
        new(mem) Value;
    }
};

template <typename>
static Value& get_internal(){
    alignas (Value) static char storage[sizeof(Value)];
    static ConstructorInvoker invoker(storage);
    return *reinterpret_cast<Value*> (storage);
}

这个reinterpret_cast用法明显是错的，是UB，针对这种场景c++17新增了上 std::launder函数来解决这个问题

另外这个ConstructorInvoker只调用一次，用init_flag在需要的时候初始化会更合适一些

template <typename>
static Value& get_internal(){
    alignas (Value) static char storage[sizeof(Value)];
    static bool needs_init = true;
    if (needs_init){
        init(key,storage,needs_init); needs_init=false;
    }
    return *std::launder(reinterpret_cast<Value*> (storage));
}

更进一步，可以加上__builtin_expect分支优化加速

if (__builtin_expext(need_flags, false))
    ...

init函数怎么搞

placement new + std::move ，保存指针保存unique_ptr，要注意，数组需要保留，多次placement new，所以要指定析构器，只析构，不回收内存，析构了的话，保证下次placement new，需要重置init_flag https://github.com/hogliux/semimap/blob/de556c74721a5017f5a03faf2fbd1c6e5a768a32/semimap.h#L198

剩下的就是讨论突破static局限以及各种map性能测试了，semimap可以理解成一个unordered_map 静态加强版

reference

https://github.com/CppCon/CppCon2018/tree/master/Presentations/a_semi_compileruntime_map_with_nearly_zero_overhead_lookup
https://github.com/hogliux/semimap
https://github.com/CppCon/CppCon2017/tree/master/Presentations/constexpr%20ALL%20the%20things
1. 代码在这里https://github.com/lefticus/constexpr_all_the_things
限于篇幅，很多enable_if都省略了，可以看参考链接2中的源代码

或者到博客上提issue 我能收到邮件提醒。

a little order delving into the stl sorting algorithms

02 Apr 2019 | |

演讲主题是对比std::sort std::partial_sort std::nth_elemet的速度

直接说结论吧。ppt很长，90页，介绍了一些benchmark工具和网站

std::sort O(N*log(N))

std::partial_sort O(N*log(K)) 可能退化成O(N) 最差持平std::sort

std::nth_element +sort O(N+k*log(k)) 可能退化成O(N) 最差持平std::sort

排序一部分

条件，100万元素，按照排序子集个数作图

在小的数据级下std::partial_sort非常可观

容器

条件，排100元素，使用容量不同的容器

Snipaste_2019-05-10_17-36-38

同上，std::partial_sort 非常可观

两种场景结合

条件，容器容量变化，排N/5个元素

Snipaste_2019-05-10_17-40-02

同样，std::partial_sort吊打 要明白场景

结论: 搜子集优先用std::parital_sort，其次用std::nth_element + std::sort

背后的原因

std::sort实现原理源码见参考链接2

template<typename _RandomAccessIterator, typename _Compare>
    inline void
    __sort(_RandomAccessIterator __first, _RandomAccessIterator __last,
	   _Compare __comp)
    {
      if (__first != __last)
	{
	  std::__introsort_loop(__first, __last,
				std::__lg(__last - __first) * 2,
				__comp);
	  std::__final_insertion_sort(__first, __last, __comp);
	}
}

主要是introsort和insert sort

introsort是quicksort和heapsort的结合体，quicksort在较差的场景下退化为O(N²)heapsort排序稳定但是能优化的场景下有多余动作，所以introsort结合两者，先递归2*log(N)层，如果没排序成功在调用heapsort，整体O(N*log(N))

参考下面的分析，总结下(这是个paper实现)

在数据量很大时采用正常的快速排序，此时效率为O(logN)。
一旦分段后的数据量小于某个阈值，就改用插入排序，因为此时这个分段是基本有序的，这时效率可达O(N)。
在递归过程中，如果递归层次过深，分割行为有恶化倾向时，它能够自动侦测出来，使用堆排序来处理，在此情况下，使其效率维持在堆排序的O(N logN)，但这又比一开始使用堆排序好

std::nth_element 见参考链接3

template<typename _RandomAccessIterator, typename _Compare>
    inline void
    nth_element(_RandomAccessIterator __first, _RandomAccessIterator __nth,
                _RandomAccessIterator __last, _Compare __comp)
{
    // concept requirements...
    if (__first == __last || __nth == __last) return;
    std::__introselect(__first, __nth, __last,
                       std::__lg(__last - __first) * 2,
                       __gnu_cxx::__ops::__iter_comp_iter(__comp));
}

类似sort introselect实现是 quickselect+heapselect

quickselect需要选pivot，然后其他类似quicksort，到nth结束。收敛的快一些

heapselect就是个建堆选择的过程复杂度 O(N*log(k))

std::partial_sort heap_select+heap sort

  template<typename _RandomAccessIterator, typename _Compare>
    inline void
    __partial_sort(_RandomAccessIterator __first,
		   _RandomAccessIterator __middle,
		   _RandomAccessIterator __last,
		   _Compare __comp)
    {
      std::__heap_select(__first, __middle, __last, __comp);
      std::__sort_heap(__first, __middle, __comp);
    }

为什么heapsort反而比introsort快？主要在于heap_select

Snipaste_2019-05-10_19-51-43

reference

https://github.com/CppCon/CppCon2018/tree/master/Presentations/a_little_order_delving_into_the_stl_sorting_algorithms
std::sort https://github.com/gcc-mirror/gcc/blob/3f7d0abcd22f9a797ea496688cbda746466f0f54/libstdc%2B%2B-v3/include/bits/stl_algo.h#L1952
std::nth_element https://github.com/gcc-mirror/gcc/blob/3f7d0abcd22f9a797ea496688cbda746466f0f54/libstdc%2B%2B-v3/include/bits/stl_algo.h#L4772
std::partial_sort https://github.com/gcc-mirror/gcc/blob/e352c93463fe598ace13d8a017c7c86e535f1065/libstdc%2B%2B-v3/include/bits/stl_algo.h#L1917
这个std::sort分析写的不错<
1. https://liam.page/2018/09/18/std-sort-in-STL/>
2. http://feihu.me/blog/2014/sgi-std-sort/
3. llvm的实现以及优化好像又不大一样 https://blog.0xbbc.com/2017/01/analysis-of-std-sort-function/

或者到博客上提issue 我能收到邮件提醒。

systench-tpcc适配mongo踩坑

31 Mar 2019 | |

首先安装mongodb 4.0 下载链接 https://www.mongodb.com/download-center/community 我用的centos 下载x64然后rpm安装就行了

然后我的实验机器有点小，准备换个外接硬盘，改db目录，这就是一切厄运的开始。

我的做法是改 /etc/mongod.conf

 +dbPath: /home/vdb/mongo/data
 -dbPath: /var/lib/mongo

报错起不来

然后我试着改 /usr/lib/systemd/system/mongod.service

+Environment="OPTIONS= --dbpath /home/vdb/mongo/data -f /etc/mongod.conf"
-Environment="OPTIONS= -f /etc/mongod.conf"

还是报错

* mongod.service - MongoDB Database Server
   Loaded: loaded (/usr/lib/systemd/system/mongod.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2019-04-03 17:32:38 CST; 14min ago
     Docs: https://docs.mongodb.org/manual
  Process: 24223 ExecStart=/usr/bin/mongod $OPTIONS (code=exited, status=100)
  Process: 24221 ExecStartPre=/usr/bin/chmod 0755 /var/run/mongodb (code=exited, status=0/SUCCESS)
  Process: 24217 ExecStartPre=/usr/bin/chown mongod:mongod /var/run/mongodb (code=exited, status=0/SUCCESS)
  Process: 24214 ExecStartPre=/usr/bin/mkdir -p /var/run/mongodb (code=exited, status=0/SUCCESS)

 Main PID: 21976 (code=exited, status=0/SUCCESS)

Apr 03 17:32:38 host-192-168-1-112 systemd[1]: Starting MongoDB Database Server...
Apr 03 17:32:38 host-192-168-1-112 mongod[24223]: about to fork child process, waiting unti...s.

Apr 03 17:32:38 host-192-168-1-112 mongod[24223]: forked process: 24226
Apr 03 17:32:38 host-192-168-1-112 mongod[24223]: ERROR: child process failed, exited with ...00

Apr 03 17:32:38 host-192-168-1-112 mongod[24223]: To see additional information in this out...n.

Apr 03 17:32:38 host-192-168-1-112 systemd[1]: mongod.service: control process exited, code...00

Apr 03 17:32:38 host-192-168-1-112 systemd[1]: Failed to start MongoDB Database Server.
Apr 03 17:32:38 host-192-168-1-112 systemd[1]: Unit mongod.service entered failed state.
Apr 03 17:32:38 host-192-168-1-112 systemd[1]: mongod.service failed.

google半天，找了好多解决方案，比如

https://stackoverflow.com/questions/5961145/changing-mongodb-data-store-directory/5961293

https://stackoverflow.com/questions/21448268/how-to-set-mongod-dbpath

https://stackoverflow.com/questions/40829306/mongodb-cant-start-centos-7

都不好使。systemd实在是搞不懂

最后还是直接执行

mongod --dbpath /home/vdb/mongo/data -f /etc/mongod.conf

搞定了,这点破事儿卡了半天。然后搜到了这个https://ruby-china.org/topics/35268，貌似是正解。不验证了。

reference

sysbench repo and build https://github.com/akopytov/sysbench#building-and-installing-from-source
https://github.com/Percona-Lab/sysbench-tpcc
1. https://www.percona.com/blog/2018/03/05/tpcc-like-workload-sysbench-1-0/
测pg 的例子<https://www.percona.com/blog/2018/06/15/tuning-postgresql-for-sysbench-tpcc/

另外这个链接下面也有mark的一些测试repo mark callaghan这个哥们有点牛逼。
https://help.ubuntu.com/stable/serverguide/postgresql.html
sysbench 参数介绍 https://wing324.github.io/2017/02/07/sysbench%E5%8F%82%E6%95%B0%E8%AF%A6%E8%A7%A3/
一个sysbench-oltp lua脚本。可以改改加上mongodb ，同时也得合入 sysbench mongo driver 是个大活https://github.com/Percona-Lab/sysbench-mongodb-lua
https://github.com/Percona-Lab/sysbench/tree/dev-mongodb-support-1.0
https://www.percona.com/blog/2016/05/13/benchmark-mongodb-sysbench/
iowait 多高算高？https://serverfault.com/questions/722804/what-percentage-of-iowait-is-considered-to-be-high

systench使用和测试

29 Mar 2019 | |

why

学习下sysbench 和sysbench-tpcc，做测试。

首先取sysbench¹，我是取源码传到服务器的，可能有些编译问题。

cd sysbench
chmod 777 *
chmod 777 third_party/concurrency_kit/ck/*
chmod 777 third_party/cram/*
chmod 777 third_party/luajit/luajit/*
./autogen.sh    
./configure --with-pgsql
make -j4
make install

遇到的问题

autogen会遇到 configure.ac:61: error: possibly undefined macro: AC_PROG_LIBTOOL ，可以安装libtool yum install libtool解决
configure会提示缺少mysql-devel和postgre-devel，按照提示安装就行
make提示编译ck失败，提示luajit没编译上，注意权限。

取sysbench-tpcc²

拿postgresql做个试验³

sudo yum install postgresql-server postgresql-contrib
//按需更改data文件目录
#vim /usr/lib/systemd/system/postgresql.service
postgresql-setup initdb
systemctl start postgresql

需要建个新账号和新库

sudo -u postgres psql postgres#登录
create user sb with password 'w';# sysbench, 注意分号结尾
create database sbtest owenr sb;#建测试库

$ ./tpcc.lua --pgsql-user=postgres --pgsql-db=sbtest --time=120 --threads=56 --report-interval=1 --tables=10 --scale=100 --use_fk=0  --trx_level=RC --db-driver=pgsql prepare

~~提示–trx_level=RC不存在？我去掉了这个配置，注意还需要密码// 这个是不是事务配置？~~

 ./tpcc.lua --pgsql-user=sb --pgsql-password=‘w’ --pgsql- db=sbtest --time=120 --threads=56 --report-interval=1 --tables=10 --scale=100 --use_fk=0 --trx_level=RC --db-driver=pgsql prepare

还会提示Ident authentication failed for user “…”

可以点击看stackoverflow解决办法，或者直接改pg_hba.conf （这个文件在data目录内）把所有ident认证的地方改成md5 注意，是测试用，知道自己在做什么。⁴，记得重启pg

执行prepare时间还挺长，还以为卡死，抓pstack不像，看top有消耗还在跑已经跑了一个小时了。

执行完之后执行

postgres=# select datname, pg_size_pretty(pg_database_size(datname)) as "DB_Size" from pg_stat_ database where datname = 'sbtest';
 datname | DB_Size
---------+---------
 sbtest  | 119 GB
(1 row)

按照流程，然后执行vaccumdb

 vacuumdb --username=sb --password -d sbtest -z

保守估计卡半小时

运行tpcc测试

./tpcc.lua --pgsql-user=sb --pgsql-db=sbtest --time=36000 --threads=56 --report-interval=1 --tables=10 --scale=100 --use_fk=0  --trx_level=RC --pgsql-password='w' --db-driver=pgsql run

可以看到输出

[ 22s ] thds: 56 tps: 46.00 qps: 1106.96 (r/w/o: 497.98/516.98/92.00) lat (ms,95%): 2985.89 err/s 0.00 reconn/s: 0.00
[ 23s ] thds: 56 tps: 45.00 qps: 1249.00 (r/w/o: 565.00/594.00/90.00) lat (ms,95%): 6026.41 err/s 0.00 reconn/s: 0.00
[ 24s ] thds: 56 tps: 41.00 qps: 1036.01 (r/w/o: 478.00/476.00/82.00) lat (ms,95%): 3982.86 err/s 0.00 reconn/s: 0.00
[ 25s ] thds: 56 tps: 49.00 qps: 1410.03 (r/w/o: 638.01/674.01/98.00) lat (ms,95%): 2985.89 err/s 1.00 reconn/s: 0.00

抓了下iostat 全卡在iowait上了

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.75    0.00    0.75   95.51    0.00    2.99

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
vdb               0.00     0.00  128.00 2033.00  4152.00 17348.00    19.90   119.59   55.92  169.77   48.75   0.46 100.00

或者到博客上提issue 我能收到邮件提醒。

博客评论系统折腾半天，还是放弃了。搞不定。

reference

sysbench repo and build https://github.com/akopytov/sysbench#building-and-installing-from-source
https://github.com/Percona-Lab/sysbench-tpcc
1. https://www.percona.com/blog/2018/03/05/tpcc-like-workload-sysbench-1-0/
测pg 的例子<https://www.percona.com/blog/2018/06/15/tuning-postgresql-for-sysbench-tpcc/

另外这个链接下面也有mark的一些测试repo mark callaghan这个哥们有点牛逼。
https://help.ubuntu.com/stable/serverguide/postgresql.html
sysbench 参数介绍 https://wing324.github.io/2017/02/07/sysbench%E5%8F%82%E6%95%B0%E8%AF%A6%E8%A7%A3/
一个sysbench-oltp lua脚本。可以改改加上mongodb ，同时也得合入 sysbench mongo driver 是个大活https://github.com/Percona-Lab/sysbench-mongodb-lua
https://github.com/Percona-Lab/sysbench/tree/dev-mongodb-support-1.0
https://www.percona.com/blog/2016/05/13/benchmark-mongodb-sysbench/
iowait 多高算高？https://serverfault.com/questions/722804/what-percentage-of-iowait-is-considered-to-be-high

db_bench测试rocksdb性能

26 Mar 2019 | |

场景需要，测试rocksdb事务的性能

刚巧有个测试² https://github.com/facebook/rocksdb/issues/4402介绍了测试结果，准备按照issue中介绍的两个脚本测试一下，gist被墙了的绕过方法见参考链接¹

执行脚本命令,注意目录rdb的设置，脚本中会有rm 目录的命令

测试环境，四核 Intel(R) Xeon(R) Gold 6161 CPU @ 2.20GHz

bash r.sh 10000000 60 32 4 ./rdb 0 ./db_bench

第一个脚本结果，我个人测试的结果完全不同

这是mack的结果

test server:
* database on /dev/shm
* 2 sockets, 24 CPU cores, 48 HW threads

legend:
* #thr - number of threads
* trx=n - no transaction
* trx=p - pessimistic transaction
* trx=o - optimistic transaction
* numbers are inserts per second

--- batch_size=1

- concurrent memtable disabled
#thr    trx=n   trx=p   trx=o
1       153571  113228  101439
2       193070  182455  137708
4       167229  182313  94811
8       250508  228031  93401
12      274251  250595  92256
16      272554  266545  93403
20      281737  276026  76885
24      287475  277981  70004
28      293445  284644  48552
32      299366  288134  43672
36      303224  292887  43047
40      304027  292000  43195
44      311686  299963  44173
48      317418  308563  48482

- concurrent memtable enabled
#thr    trx=n   trx=p   trx=o
1       152156  110235  101901
2       164778  161547  130980
4       228060  193945  116742
8       335001  311307  114802
12      401206  379568  100576
16      445484  419819  72979
20      465297  435283  45554
24      472754  451805  40381
28      490107  456741  40108
32      482851  467469  40179
36      487332  473892  39866
40      485026  457858  43587
44      481420  442169  42293
48      423738  427396  40346

--- batch_size=4

- concurrent memtable disabled
#thr    trx=n   trx=p   trx=o
1       37838   28709   19807
2       62955   48829   30995
4       84903   72286   31754
8       95389   91310   25169
12      95297   97581   18739
16      92296   91696   17574
20      94451   91210   17319
24      91072   89522   16920
28      91429   91015   17170
32      92991   90158   17424
36      92823   89044   17332
40      91854   88994   17099
44      91766   88434   16909
48      91335   89298   16720

- concurrent memtable enabled
#thr    trx=n   trx=p   trx=o
1       38368   28374   19783
2       63711   48045   31141
4       99853   81364   35032
8       163958  134011  28212
12      211083  175932  18142
16      243147  207610  17281
20      254355  224073  16908
24      275674  238600  16875
28      286050  247888  17215
32      281926  252813  17657
36      274349  249263  16830
40      275749  241185  16726
44      266127  234881  16506
48      267183  235147  16760

-- test script

numk=$1
totw=$2
val=$3
batch=$4
dbdir=$5
sync=$6

# sync, dbdir, concurmt, secs, dop

function runme {
  a_concurmt=$1
  a_dop=$2
  a_extra=$3

  thrw=$(( $totw / $a_dop ))
  echo $a_dop threads, $thrw writes per thread
  rm -rf $dbdir; mkdir $dbdir
  # TODO --perf_level=0

echo ./db_bench --benchmarks=randomtransaction --use_existing_db=0 --sync=$sync --db=$dbdir --wal_dir=$dbdir --disable_data_sync=0 --num=$numk --writes=$thrw --num_levels=6 --key_size=8 --value_size=$val --block_size=4096 --cache_size=$(( 20 * 1024 * 1024 * 1024 )) --cache_numshardbits=6 --compression_type=snappy --min_level_to_compress=3 --compression_ratio=0.5 --level_compaction_dynamic_level_bytes=true --bytes_per_sync=8388608 --cache_index_and_filter_blocks=0 --benchmark_write_rate_limit=0 --hard_rate_limit=3 --rate_limit_delay_max_milliseconds=1000000 --write_buffer_size=134217728 --max_write_buffer_number=16 --target_file_size_base=33554432 --max_bytes_for_level_base=536870912 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_grandparent_overlap_factor=8 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=60 --histogram=1 --allow_concurrent_memtable_write=$a_concurmt --enable_write_thread_adaptive_yield=$a_concurmt --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=12 --level0_stop_writes_trigger=20 --max_background_compactions=16 --max_background_flushes=7 --threads=$a_dop --merge_operator="put" --seed=1454699926 --transaction_sets=$batch $a_extra


./db_bench --benchmarks=randomtransaction --use_existing_db=0 --sync=$sync --db=$dbdir --wal_dir=$dbdir --disable_data_sync=0 --num=$numk --writes=$thrw --num_levels=6 --key_size=8 --value_size=$val --block_size=4096 --cache_size=$(( 20 * 1024 * 1024 * 1024 )) --cache_numshardbits=6 --compression_type=snappy --min_level_to_compress=3 --compression_ratio=0.5 --level_compaction_dynamic_level_bytes=true --bytes_per_sync=8388608 --cache_index_and_filter_blocks=0 --benchmark_write_rate_limit=0 --hard_rate_limit=3 --rate_limit_delay_max_milliseconds=1000000 --write_buffer_size=134217728 --max_write_buffer_number=16 --target_file_size_base=33554432 --max_bytes_for_level_base=536870912 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_grandparent_overlap_factor=8 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=60 --histogram=1 --allow_concurrent_memtable_write=$a_concurmt --enable_write_thread_adaptive_yield=$a_concurmt --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=12 --level0_stop_writes_trigger=20 --max_background_compactions=16 --max_background_flushes=7 --threads=$a_dop --merge_operator="put" --seed=1454699926 --transaction_sets=$batch $a_extra
}


for dop in 1 2 4 8 12 16 20 24 28 32 36 40 44 48 ; do
# for dop in 1 24 ; do
for concurmt in 0 1 ; do

fn=o.dop${dop}.val${val}.batch${batch}.concur${concurmt}.notrx
runme $concurmt $dop "" >& $fn
q1=$( grep ^randomtransaction $fn | awk '{ print $5 }' )

t=transaction_db
fn=o.dop${dop}.val${val}.batch${batch}.concur${concurmt}.pessim
runme $concurmt $dop --${t}=1 >& $fn
q2=$( grep ^randomtransaction $fn | awk '{ print $5 }' )

t=optimistic_transaction_db
fn=o.dop${dop}.val${val}.batch${batch}.concur${concurmt}.optim
runme $concurmt $dop --${t}=1 >& $fn
q3=$( grep ^randomtransaction $fn | awk '{ print $5 }' )

echo $dop mt${concurmt} $q1 $q2 $q3 | awk '{ printf "%s\t%s\t%s\t%s\t%s\n", $1, $2, $3, $4, $5 }'

done
done

这是我的结果, 很快就执行完了（我觉得有点奇怪，但没深究，好像是执行一定次数就结束）

thr     mt0/mt1 trx=n   trx=p   trx=o
     mt0     78534   24291   43891
     mt1     81411   38734   54249
     mt0     104529  49916   75000
     mt1     101522  57747   76335
     mt0     88365   60000   49916
     mt1     121212  72115   18850
     mt0     77455   45714   47538
     mt1     36577   57377   22810
    mt0     72551   46367   47318
    mt1     29761   14587   65359
    mt0     64343   39376   47151
    mt1     10551   38095   19448
    mt0     69284   36057   45045
    mt1     11947   45731   61037
    mt0     63576   30573   42933
    mt1     13655   37765   52401
    mt0     58947   32520   43043
    mt1     6090    8342    17598
    mt0     50632   25827   30563
    mt1     7158    16469   18223
    mt0     44831   25069   33210
    mt1     18172   10395   34090
    mt0     43572   33613   27797
    mt1     11500   30721   15612
    mt0     50285   27865   26862
    mt1     7251    10661   25821
    mt0     43282   25668   32388
    mt1     19223   25751   14239

可以看到数据完全是反常的，我反复执行多次都是这种现象，有时候还有卡顿，hang住

第二个脚本

numk=$1
secs=$2
val=$3
batch=$4
dbdir=$5
sync=$6
dbb=$7

# sync, dbdir, concurmt, secs, dop

function runme {
  a_concurmt=$1
  a_dop=$2
  a_extra=$3

  rm -rf $dbdir; mkdir $dbdir
  # TODO --perf_level=0

$dbb --benchmarks=randomtransaction --use_existing_db=0 --sync=$sync --db=$dbdir --wal_dir=$dbdir --num=$numk --duration=$secs --num_levels=6 --key_size=8 --value_size=$val --block_size=4096 --cache_size=$(( 20 * 1024 * 1024 * 1024 )) --cache_numshardbits=6 --compression_type=none --compression_ratio=0.5 --level_compaction_dynamic_level_bytes=true --bytes_per_sync=8388608 --cache_index_and_filter_blocks=0 --benchmark_write_rate_limit=0 --write_buffer_size=$(( 64 * 1024 * 1024 )) --max_write_buffer_number=4 --target_file_size_base=$(( 32 * 1024 * 1024 )) --max_bytes_for_level_base=$(( 512 * 1024 * 1024 )) --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=60 --histogram=1 --allow_concurrent_memtable_write=$a_concurmt --enable_write_thread_adaptive_yield=$a_concurmt --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_background_flushes=2 --threads=$a_dop --merge_operator="put" --seed=1454699926 --transaction_sets=$batch --compaction_pri=3 $a_extra
}

for dop in 1 2 4 8 16 24 32 40 48 ; do
for concurmt in 0 1 ; do

fn=o.dop${dop}.val${val}.batch${batch}.concur${concurmt}.notrx
runme $concurmt $dop "" >& $fn
q1=$( grep ^randomtransaction $fn | awk '{ print $5 }' )

t=transaction_db
fn=o.dop${dop}.val${val}.batch${batch}.concur${concurmt}.pessim
runme $concurmt $dop --${t}=1 >& $fn
q2=$( grep ^randomtransaction $fn | awk '{ print $5 }' )

t=optimistic_transaction_db
fn=o.dop${dop}.val${val}.batch${batch}.concur${concurmt}.optim
runme $concurmt $dop --${t}=1 >& $fn
q3=$( grep ^randomtransaction $fn | awk '{ print $5 }' )

echo $dop mt${concurmt} $q1 $q2 $q3 | awk '{ printf "%s\t%s\t%s\t%s\t%s\n", $1, $2, $3, $4, $5 }'

done
done

执行到transaction-db ，线程数大于4就会卡死，前几个数据

     mt0     61676   35118   43794
     mt1     60019   35307   44344
     mt0     98688   55459   70069
     mt1     103991  59430   75082

执行命令⁴ 查看堆栈信息

gdb -ex "set pagination 0" -ex "thread apply all bt" \
  --batch -p $(pidof db_bench)

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fa47568d6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

Thread 5 (Thread 0x7fa46b5c3700 (LWP 14215)):
#0  0x00007fa47568d6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fa47475f9ac in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2  0x000000000067e47c in std::condition_variable::wait<rocksdb::WriteThread::BlockingAwaitState(rocksdb::WriteThread::Writer*, uint8_t)::__lambda4> (__p=..., __lock=..., this=0x7fa46b5c1e90) at /usr/include/c++/4.8.2/condition_variable:93
#3  rocksdb::WriteThread::BlockingAwaitState (this=this@entry=0x2b5cf30, w=w@entry=0x7fa46b5c1df0, goal_mask=goal_mask@entry=30 '\036') at db/write_thread.cc:45
#4  0x000000000067e590 in rocksdb::WriteThread::AwaitState (this=this@entry=0x2b5cf30, w=w@entry=0x7fa46b5c1df0, goal_mask=goal_mask@entry=30 '\036', ctx=ctx@entry=0xae62b0 <rocksdb::jbg_ctx>) at db/write_thread.cc:181
#5  0x000000000067ea23 in rocksdb::WriteThread::JoinBatchGroup (this=this@entry=0x2b5cf30, w=w@entry=0x7fa46b5c1df0) at db/write_thread.cc:323
#6  0x00000000005fba9b in rocksdb::DBImpl::PipelinedWriteImpl (this=this@entry=0x2b5c800, write_options=..., my_batch=my_batch@entry=0x7fa46b5c2630, callback=callback@entry=0x0, log_used=log_used@entry=0x0, log_ref=log_ref@entry=0, disable_memtable=disable_memtable@entry=false, seq_used=seq_used@entry=0x0) at db/db_impl_write.cc:418
#7  0x00000000005fe092 in rocksdb::DBImpl::WriteImpl (this=0x2b5c800, write_options=..., my_batch=my_batch@entry=0x7fa46b5c2630, callback=callback@entry=0x0, log_used=log_used@entry=0x0, log_ref=log_ref@entry=0, disable_memtable=disable_memtable@entry=false, seq_used=seq_used@entry=0x0, batch_cnt=batch_cnt@entry=0, pre_release_callback=pre_release_callback@entry=0x0) at db/db_impl_write.cc:109
#8  0x00000000007d82fb in rocksdb::WriteCommittedTxn::RollbackInternal (this=0x2b6b9f0) at utilities/transactions/pessimistic_transaction.cc:367
#9  0x00000000007d568a in rocksdb::PessimisticTransaction::Rollback (this=0x2b6b9f0) at utilities/transactions/pessimistic_transaction.cc:341
#10 0x00000000007449ca in rocksdb::RandomTransactionInserter::DoInsert (this=this@entry=0x7fa46b5c2ad0, db=db@entry=0x0, txn=<optimized out>, is_optimistic=is_optimistic@entry=false) at util/transaction_test_util.cc:191
#11 0x0000000000744fd9 in rocksdb::RandomTransactionInserter::TransactionDBInsert (this=this@entry=0x7fa46b5c2ad0, db=<optimized out>, txn_options=...) at util/transaction_test_util.cc:55
#12 0x0000000000561c5a in rocksdb::Benchmark::RandomTransaction (this=0x7ffd2127ed30, thread=0x2b95680) at tools/db_bench_tool.cc:5058
#13 0x0000000000559b59 in rocksdb::Benchmark::ThreadBody (v=0x2b5dba8) at tools/db_bench_tool.cc:2687
#14 0x00000000006914c2 in rocksdb::(anonymous namespace)::StartThreadWrapper (arg=0x2b85350) at env/env_posix.cc:994
#15 0x00007fa475689dc5 in start_thread () from /lib64/libpthread.so.0
#16 0x00007fa473ecb73d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7fa472dd2700 (LWP 14197)):
#0  0x00007fa47568d6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fa47475f9ac in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2  0x000000000073ffd4 in rocksdb::ThreadPoolImpl::Impl::BGThread (this=this@entry=0x2890760, thread_id=thread_id@entry=1) at util/threadpool_imp.cc:196
#3  0x000000000074038f in rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper (arg=0x2b603f0) at util/threadpool_imp.cc:303
#4  0x00007fa4747631e0 in ?? () from /lib64/libstdc++.so.6
#5  0x00007fa475689dc5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007fa473ecb73d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fa4735d3700 (LWP 14196)):
#0  0x00007fa47568d6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fa47475f9ac in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2  0x000000000073ffd4 in rocksdb::ThreadPoolImpl::Impl::BGThread (this=this@entry=0x2890760, thread_id=thread_id@entry=0) at util/threadpool_imp.cc:196
#3  0x000000000074038f in rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper (arg=0x2b603d0) at util/threadpool_imp.cc:303
#4  0x00007fa4747631e0 in ?? () from /lib64/libstdc++.so.6
#5  0x00007fa475689dc5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007fa473ecb73d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fa473dd4700 (LWP 14195)):
#0  0x00007fa47568d6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fa47475f9ac in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2  0x000000000073ffd4 in rocksdb::ThreadPoolImpl::Impl::BGThread (this=this@entry=0x2890420, thread_id=thread_id@entry=0) at util/threadpool_imp.cc:196
#3  0x000000000074038f in rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper (arg=0x2b600c0) at util/threadpool_imp.cc:303
#4  0x00007fa4747631e0 in ?? () from /lib64/libstdc++.so.6
#5  0x00007fa475689dc5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007fa473ecb73d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fa475a9fa40 (LWP 14194)):
#0  0x00007fa47568d6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000006d870d in rocksdb::port::CondVar::Wait (this=this@entry=0x7ffd2127e578) at port/port_posix.cc:91
#2  0x000000000055c969 in rocksdb::Benchmark::RunBenchmark (this=this@entry=0x7ffd2127ed30, n=n@entry=4, name=..., method=(void (rocksdb::Benchmark::*)(rocksdb::Benchmark * const, rocksdb::ThreadState *)) 0x561ab0 <rocksdb::Benchmark::RandomTransaction(rocksdb::ThreadState*)>) at tools/db_bench_tool.cc:2759
#3  0x000000000056d9d7 in rocksdb::Benchmark::Run (this=this@entry=0x7ffd2127ed30) at tools/db_bench_tool.cc:2638
#4  0x000000000054d481 in rocksdb::db_bench_tool (argc=1, argv=0x7ffd2127f4c8) at tools/db_bench_tool.cc:5472
#5  0x00007fa473df6bb5 in __libc_start_main () from /lib64/libc.so.6
#6  0x000000000054c201 in _start ()

能看到卡在wait上了，应该是死锁了，其他写线程await主写线程。

我当时没怀疑是db_bench的问题，就是单纯的认为卡住了，毕竟第一个脚本测试好用，怀疑是机器不行，issue中mack用32核机器测试。我于是找了个32核的机器Intel(R) Xeon(R) Gold 6161 CPU @ 2.20GHz重新测试第二个脚本

重新测试，还是会卡死。抓pstack

pstack 14194
Thread 5 (Thread 0x7fa46b5c3700 (LWP 14215)):
#0  0x00007fa47568d6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fa47475f9ac in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2  0x000000000067e47c in std::condition_variable::wait<rocksdb::WriteThread::BlockingAwaitState(rocksdb::WriteThread::Writer*, uint8_t)::__lambda4> (__p=..., __lock=..., this=0x7fa46b5c1e90) at /usr/include/c++/4.8.2/condition_variable:93
#3  rocksdb::WriteThread::BlockingAwaitState (this=this@entry=0x2b5cf30, w=w@entry=0x7fa46b5c1df0, goal_mask=goal_mask@entry=30 '\036') at db/write_thread.cc:45
#4  0x000000000067e590 in rocksdb::WriteThread::AwaitState (this=this@entry=0x2b5cf30, w=w@entry=0x7fa46b5c1df0, goal_mask=goal_mask@entry=30 '\036', ctx=ctx@entry=0xae62b0 <rocksdb::jbg_ctx>) at db/write_thread.cc:181
#5  0x000000000067ea23 in rocksdb::WriteThread::JoinBatchGroup (this=this@entry=0x2b5cf30, w=w@entry=0x7fa46b5c1df0) at db/write_thread.cc:323
#6  0x00000000005fba9b in rocksdb::DBImpl::PipelinedWriteImpl (this=this@entry=0x2b5c800, write_options=..., my_batch=my_batch@entry=0x7fa46b5c2630, callback=callback@entry=0x0, log_used=log_used@entry=0x0, log_ref=log_ref@entry=0, disable_memtable=disable_memtable@entry=false, seq_used=seq_used@entry=0x0) at db/db_impl_write.cc:418
#7  0x00000000005fe092 in rocksdb::DBImpl::WriteImpl (this=0x2b5c800, write_options=..., my_batch=my_batch@entry=0x7fa46b5c2630, callback=callback@entry=0x0, log_used=log_used@entry=0x0, log_ref=log_ref@entry=0, disable_memtable=disable_memtable@entry=false, seq_used=seq_used@entry=0x0, batch_cnt=batch_cnt@entry=0, pre_release_callback=pre_release_callback@entry=0x0) at db/db_impl_write.cc:109
#8  0x00000000007d82fb in rocksdb::WriteCommittedTxn::RollbackInternal (this=0x2b6b9f0) at utilities/transactions/pessimistic_transaction.cc:367
#9  0x00000000007d568a in rocksdb::PessimisticTransaction::Rollback (this=0x2b6b9f0) at utilities/transactions/pessimistic_transaction.cc:341
#10 0x00000000007449ca in rocksdb::RandomTransactionInserter::DoInsert (this=this@entry=0x7fa46b5c2ad0, db=db@entry=0x0, txn=<optimized out>, is_optimistic=is_optimistic@entry=false) at util/transaction_test_util.cc:191
#11 0x0000000000744fd9 in rocksdb::RandomTransactionInserter::TransactionDBInsert (this=this@entry=0x7fa46b5c2ad0, db=<optimized out>, txn_options=...) at util/transaction_test_util.cc:55
#12 0x0000000000561c5a in rocksdb::Benchmark::RandomTransaction (this=0x7ffd2127ed30, thread=0x2b95680) at tools/db_bench_tool.cc:5058
#13 0x0000000000559b59 in rocksdb::Benchmark::ThreadBody (v=0x2b5dba8) at tools/db_bench_tool.cc:2687
#14 0x00000000006914c2 in rocksdb::(anonymous namespace)::StartThreadWrapper (arg=0x2b85350) at env/env_posix.cc:994
#15 0x00007fa475689dc5 in start_thread () from /lib64/libpthread.so.0
#16 0x00007fa473ecb73d in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7fa472dd2700 (LWP 14197)):
#0  0x00007fa47568d6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fa47475f9ac in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2  0x000000000073ffd4 in rocksdb::ThreadPoolImpl::Impl::BGThread (this=this@entry=0x2890760, thread_id=thread_id@entry=1) at util/threadpool_imp.cc:196
#3  0x000000000074038f in rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper (arg=0x2b603f0) at util/threadpool_imp.cc:303
#4  0x00007fa4747631e0 in ?? () from /lib64/libstdc++.so.6
#5  0x00007fa475689dc5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007fa473ecb73d in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7fa4735d3700 (LWP 14196)):
#0  0x00007fa47568d6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fa47475f9ac in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2  0x000000000073ffd4 in rocksdb::ThreadPoolImpl::Impl::BGThread (this=this@entry=0x2890760, thread_id=thread_id@entry=0) at util/threadpool_imp.cc:196
#3  0x000000000074038f in rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper (arg=0x2b603d0) at util/threadpool_imp.cc:303
#4  0x00007fa4747631e0 in ?? () from /lib64/libstdc++.so.6
#5  0x00007fa475689dc5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007fa473ecb73d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7fa473dd4700 (LWP 14195)):
#0  0x00007fa47568d6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fa47475f9ac in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2  0x000000000073ffd4 in rocksdb::ThreadPoolImpl::Impl::BGThread (this=this@entry=0x2890420, thread_id=thread_id@entry=0) at util/threadpool_imp.cc:196
#3  0x000000000074038f in rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper (arg=0x2b600c0) at util/threadpool_imp.cc:303
#4  0x00007fa4747631e0 in ?? () from /lib64/libstdc++.so.6
#5  0x00007fa475689dc5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007fa473ecb73d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fa475a9fa40 (LWP 14194)):
#0  0x00007fa47568d6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000006d870d in rocksdb::port::CondVar::Wait (this=this@entry=0x7ffd2127e578) at port/port_posix.cc:91
#2  0x000000000055c969 in rocksdb::Benchmark::RunBenchmark (this=this@entry=0x7ffd2127ed30, n=n@entry=4, name=..., method=(void (rocksdb::Benchmark::*)(rocksdb::Benchmark * const, rocksdb::ThreadState *)) 0x561ab0 <rocksdb::Benchmark::RandomTransaction(rocksdb::ThreadState*)>) at tools/db_bench_tool.cc:2759
#3  0x000000000056d9d7 in rocksdb::Benchmark::Run (this=this@entry=0x7ffd2127ed30) at tools/db_bench_tool.cc:2638
#4  0x000000000054d481 in rocksdb::db_bench_tool (argc=1, argv=0x7ffd2127f4c8) at tools/db_bench_tool.cc:5472
#5  0x00007fa473df6bb5 in __libc_start_main () from /lib64/libc.so.6
#6  0x000000000054c201 in _start ()

多了点信息，比如pipeline write⁵。wolfkdy确定是db_bench的bug(感谢。我都没有这么自信)。然后找到了rocksdb 的fix

5.15.0 (7/17/2018)
......
Bug Fixes
Fix deadlock with **enable_pipelined_write=true** and max_successive_merges > 0
Check conflict at output level in CompactFiles.
Fix corruption in non-iterator reads when mmap is used for file reads
Fix bug with prefix search in partition filters where a shared prefix would be ignored from the later partitions. The bug could report an eixstent key as missing. The bug could be triggered if prefix_extractor is set and partition filters is enabled.
Change default value of bytes_max_delete_chunk to 0 in NewSstFileManager() as it doesn't work well with checkpoints.
Fix a bug caused by not copying the block trailer with compressed SST file, direct IO, prefetcher and no compressed block cache.
Fix write can stuck indefinitely if enable_pipelined_write=true. The issue exists since pipelined write was introduced in 5.5.0.

这个参数我在db_bench页面搜了，没搜到（应该是很久没更新了。我给加上了），在pipeline write页面中列出了。

db_bench help页面也列出了这个参数。我没想到。下次记得先看软件自带的man page

加上enable_pipelined_write=false后，新测了一组数据，符合预期

     mt0     39070   22716   23107
     mt1     39419   22649   23345
     mt0     60962   33602   27778
     mt1     66347   35297   31959
     mt0     63993   42740   26964
     mt1     91138   50720   28831
     mt0     81788   52713   25167
     mt1     141298  72900   25832
    mt0     90463   62032   21954
    mt1     194290  100470  21581
    mt0     87967   64610   20957
    mt1     226909  111770  20506
    mt0     88986   65632   20474
    mt1     110627  123805  20040
    mt0     86774   66612   19835
    mt1     113140  58720   19886
    mt0     86848   68086   19611

参考

gist被屏蔽的一个解决办法 https://blog.jiayu.co/2018/06/an-alternative-github-gist-viewer/ 这个帮助很大
一个测试参考https://github.com/facebook/rocksdb/issues/4402
db_bench介绍，注意，没有写隐藏参数enable_pipelined_write=true默认https://github.com/facebook/rocksdb/wiki/Benchmarking-tools
poor man‘s profiler https://poormansprofiler.org/ 感谢mack
pipeline 提升性能 https://github.com/facebook/rocksdb/wiki/Pipelined-Write 测试结果 https://gist.githubusercontent.com/yiwu-arbug/3b5a5727e52f1e58d1c10f2b80cec05d/raw/fc1df48c4fff561da0780d83cd8aba2721cdf7ac/gistfile1.txt
这个滴滴的大神fix的这个bug，链接里有分析过程https://bravoboy.github.io/2018/09/11/rocksdb-deadlock/

why

reference

reference

reference

reference

if/else状态机

switch/enum状态机

继承 状态模式

std::variant + std::visit

coroutines + loop

coroutines + goto

coroutines + functions + variant

boost statechart

boost msm

boost sml

reference

reference

排序一部分

容器

两种场景结合

背后的原因

reference

reference

why

reference

参考

继承状态模式