改造pika如何去掉key锁

14 Jul 2020 | |

这个是同事的修改经验，虽然是针对业务而言的，但是这个思路十分的降维打击，我直接抄过来了

现有模型是slot-proxy-shard结构，上层有代理来转发，shard节点，也就是pika，只负责一部分数据

但是pika本身是有key锁的

比如https://github.com/Qihoo360/blackwidow/blob/ae38f5b4c5c01c7f8b9deec58db752e056659264/src/redis_lists.cc#L273

Status RedisLists::LInsert(const Slice& key,
                           const BeforeOrAfter& before_or_after,
                           const std::string& pivot,
                           const std::string& value,
                           int64_t* ret) {
  *ret = 0;
  rocksdb::WriteBatch batch;
  ScopeRecordLock l(lock_mgr_, key);
  std::string meta_value;
  Status s = db_->Get(default_read_options_, handles_[0], key, &meta_value);
  if (s.ok()) {
    ParsedListsMetaValue parsed_lists_meta_value(&meta_value);
    if (parsed_lists_meta_value.IsStale()) {
      return Status::NotFound("Stale");
    } else if (parsed_lists_meta_value.count() == 0) {
      return Status::NotFound();
    } else {
        ...

前面已经有一层proxy转发了，这层转发是根据hash来算还是根据range来算不重要，重要的是到计算节点，已经缩小了key范围了，还要加锁，这层锁是可以优化掉的

保证落到这个db的key的顺序，也就是说，相同hash/range用同一个连接，保证了命令的顺序，就不会有锁的问题，锁也就可以省掉。从上层来解决掉这个锁的问题
- shard节点的命令处理线程，要保证hash/range相同的用同一个连接线程工作。多一点计算，省掉一个锁。

pika单机版，用的是同一个db，同一个db是很难去掉key锁的。要想应用上面的改造，也要改成多db模式，把db改成slot，然后根据slot划分线程，然后根据线程来分配命令，保证命令的顺序，省掉锁。

省掉key锁的收益是很大的。尤其是一个命令里有很多次get，get很吃延迟，导致锁占用很长，导致积少成多的影响

rocksdb涉及到关闭开启的时间优化

13 Jul 2020 | |

这个是同事调参的经验，我直接抄过来了

不少都是经常提到的

rocksdb配置：

compaction加速设置compaction_readahead_size 很常规。值可以改改试试，16k 256k都有
wal日志调小 max_manifest_file_size, max-total-wal-size
close db的时候停掉compaction
- rocksdb里有compaction任务，可能还会耗时，能停掉就更好了
close会主动flush flush可能触发compaction和write stall。先跳过
open会读wal恢复memtable，所以最好不要有wal，close的时候刷掉
targetbase和放大因子要根据自身的存储来调整比如写入hdfs，设置60M就比较合适，不会频繁更新元数据

打开rocksdb文件过多 GetFileSize时间太长

看到一个PR https://github.com/facebook/rocksdb/pull/6353/files 6.8版本在open阶段规避调用 GetFileSize
我们这边同事的解决方案是hack OnAddFile调用的地方，改成多线程cv来添加。

这两个优化点位置不一样，一个在打开恢复校验阶段，一个在打开阶段，这个打开阶段OnAddFile有GetFileSize，所以在OnAddFile之前，调用一把查文件大小，避开GetFileSize

    std::vector<LiveFileMetaData> metadata;

    impl->mutex_.Lock();
    impl->versions_->GetLiveFilesMetaData(&metadata);
    impl->mutex_.Unlock();

~~后面OnAddFile阶段就不用查size了，通过这里拿到的metadata~~，但是有的文件查不到，这里，原生rocksdb用的是GetFileSize，同事是用多线程异步去查，相比原生一个一个查能加速一点

dynomite简单分析

09 Jul 2020 | |

[toc]

代码 https://github.com/Netflix/dynomite

学习/探索mongo

05 Jul 2020 | |

ppt地址，资料都是整理

macos上安装和演示

#安装
brew tap mongodb/brew
brew install mongodb-community@4.4
#拉起
brew services start mongodb-community@4.4
#停止
brew services stop mongodb-community@4.4
# mongo shell
mongo 127.0.0.1:27017

mongo 和sql对应概念区分

SQL术语/概念	MongoDB 术语/概念
database	database
table	collection
row	document 或 BSON document
column	field
index	index
table joins （表联接）	$lookup, `embedded documents （嵌入式文档）`
primary key 指定任何唯一的列或者列组合作为主键	primary key 在 MongoDB 中, 主键自动设置为 _id 字段
aggregation (如：group by)	`aggregation pipeline （聚合管道）`参考：SQL to Aggregation Mapping Chart
SELECT INTO NEW_TABLE	$out 参考： SQL to Aggregation Mapping Chart
MERGE INTO TABLE	$merge （从MongoDB 4.2开始可用）参考：SQL to Aggregation Mapping Chart
transactions	transactions

二进制对应关系

	MongoDB	MySQL
数据库服务端	mongod	mysqld
数据库客户端	mongo	mysql
复制日志	oplog	binlog
恢复用日志	journal	redolog

最新 oplog 时间戳	snapshot	状态
t0	snapshot0	committed
t1	snapshot1	uncommitted
t2	snapshot2	uncommitted
t3	snapshot3	uncommitted

ref

https://www.runoob.com/mongodb/mongodb-osx-install.html
https://aotu.io/notes/2020/06/07/sql-to-mongo-1/index.html

threads safety annotations 以及std::priority_queue的一个小用法

02 Jul 2020 | |

我是随便浏览某个时间队列看到的类似的代码

  mutable mutex mu_;
  condition_variable cv_;
  std::thread timer_thread_;
  std::atomic<bool> stop_{false};
  std::priority_queue<RCReference<TimerEntry>,
                      std::vector<RCReference<TimerEntry>>,
                      TimerEntry::TimerEntryCompare>
      timers_ TFRT_GUARDED_BY(mu_);

这个GUARDED_BY让人好奇，简单查证了一番，发现是clang的工具

简单说就是clang编译器带的一个多线程的帮手，线程安全注解，原理是拓展 __attribute__

比如 __attribute__(guarded_by(mutex))

这样指明依赖关系，更能方便定位问题

使用的话编译带上 -Wthread-safety-analysis就可以了

没发现gcc有类似的工具。可惜。

另外这些时间队列的实现用的 std::priority_queue 很有意思，都指定了容器参数（因为不是内建的类型，没有实现operator <）

我看rocksdb的timequeue长这样

  // Inheriting from priority_queue, so we can access the internal container
  class Queue : public std::priority_queue<WorkItem, std::vector<WorkItem>,
                                           std::greater<WorkItem>> {
   public:
    std::vector<WorkItem>& getContainer() { return this->c; }

直接把容器参数暴漏出来。挺新颖的。这个数据结构设计保留了c就是为了这样暴露吧。

ref

https://clang.llvm.org/docs/ThreadSafetyAnalysis.html
- 可以直接把这个宏抄过去 http://clang.llvm.org/docs/ThreadSafetyAnalysis.html#mutex-h
  - https://github.com/tensorflow/runtime/blob/1f60e4778e91d9932ac04647769a178a9646c0a7/include/tfrt/support/thread_annotations.h 直接抄的
原理论文 https://research.google.com/pubs/archive/42958.pdf
ppt介绍 https://llvm.org/devmtg/2011-11/Hutchins_ThreadSafety.pdf
用法 1 https://stackoverflow.com/questions/40468897/clang-thread-safety-with-stdcondition-variable
用法 2 https://zhuanlan.zhihu.com/p/47837673
std::priority_queue 看成员对象那一小节https://en.cppreference.com/w/cpp/container/priority_queue
定时器实现总结 https://www.ibm.com/developerworks/cn/linux/l-cn-timers/index.html 文章写得很棒

重点关注最小堆(优先队列) 来维护定时器组，以及时间轮

https://www.zhihu.com/question/68451392 管理定时器，不一定需要timerqueue 暴力扫也不是不可以只要timer不多
kafka中的时间轮 https://club.perfma.com/article/328984
https://www.cnblogs.com/zhongwencool/p/timing_wheel.html 他这个博客做的不错。。。

六月待读 need review

24 Jun 2020 | |

我发现越攒越多了这东西

https://github.com/YongjunHe/corobase

https://hal.inria.fr/file/index/docid/555588/filename/techreport.pdf

oatpp

https://github.com/oatpp/oatpp#api-controller-and-request-mapping

continuable

https://naios.github.io/continuable/

https://chubaofs.github.io/chubaodb/zh-CN/config.html

https://hammertux.github.io/slab-allocator

coroutine

https://luncliff.github.io/coroutine/articles/combining-coroutines-and-pthread_create/

https://www.jianshu.com/u/bb58761c6c04

分布式

https://cloud.tencent.com/developer/article/1015442

network

https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#

https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/

crdt

https://techbeacon.com/app-dev-testing/how-simplify-distributed-app-development-crdts

https://redislabs.com/redis-enterprise/technology/active-active-geo-distribution/

mongo

https://mongoing.com/archives/6102

https://blog.csdn.net/baijiwei/article/details/78070355

https://blog.csdn.net/baijiwei/article/details/78303200

https://zhuanlan.zhihu.com/c_1047791597869199360

redis延迟分析

https://github.com/moooofly/MarkSomethingDown/blob/master/Redis/Redis%20%E8%AE%BF%E9%97%AE%E5%BB%B6%E8%BF%9F%E9%97%AE%E9%A2%98%E5%88%86%E6%9E%90.md

brpc

https://zhuanlan.zhihu.com/p/113427004

hbase结构介绍

https://niyanchun.com/hbase-introduction-extend.html

我比较关系高可用，负载均衡（range server分裂，怎么做的），以及强一致性的实现

大多是怎么用的文档

bluestore

https://zhuanlan.zhihu.com/p/45084771

https://zhuanlan.zhihu.com/p/46362124

making tcp fast

应该有中文文档

https://netdevconf.info/1.2/papers/bbr-netdev-1.2.new.new.pdf

network 101

https://hpbn.co/building-blocks-of-udp/#null-protocol-services

brendan gregg博客分享

抓tcp

http://www.brendangregg.com/blog/2018-03-22/tcp-tracepoints.html

bpf

http://www.brendangregg.com/bpf-performance-tools-book.html

nginx 延迟高吞吐分享ppt

https://www.nginx.com/blog/optimizing-web-servers-for-high-throughput-and-low-latency/

大页内存和tlb

https://www.mnstory.net/2016/06/30/qemu-hugepages/

还是透明大页

https://www.percona.com/blog/2019/03/06/settling-the-myth-of-transparent-hugepages-for-databases/

针对写放大的kv sifrdb

https://nan01ab.github.io/2019/02/LSM-Trees(2).html

jungle

btree + lsm https://www.usenix.org/system/files/hotstorage19-paper-ahn.pdf

https://www.usenix.org/sites/default/files/conference/protected-files/hotstorage19_slides_ahn.pdf

写个快的json parser

https://chadaustin.me/2017/05/writing-a-really-really-fast-json-parser/

https://chadaustin.me/2013/01/sajson-why-the-parse-tree-is-big-enough/

smfrpchttps://smfrpc.github.io/smf/

years watching https://nwat.xyz/blog/2018/01/15/systems-research-im-watching-for-in-2018/

linux性能调优指南

https://lihz1990.gitbooks.io/transoflptg/content/01.%E7%90%86%E8%A7%A3Linux%E6%93%8D%E4%BD%9C%E7%B3%BB%E7%BB%9F/1.1.Linux%E8%BF%9B%E7%A8%8B%E7%AE%A1%E7%90%86.html

深入理解iostat

https://bean-li.github.io/dive-into-iostat/

为什么就没有个一图剩千言的模板图呢？

十分钟教会看top

https://juejin.im/post/5d590126f265da03db0776b6

为什么就没有个一图剩千言的模板图呢？？

tcpdump

https://danielmiessler.com/study/tcpdump/

为什么就没有个一图剩千言的模板图呢？？？

boltdb介绍

https://lrita.github.io/2017/05/21/boltdb-overview-0/

常见db比较

https://cloud.tencent.com/developer/article/1067439

SILT – A Memory-Efficient, High-Performance Key-Value Store

https://nan01ab.github.io/2018/04/SILT.html

再议silt

http://blog.foool.net/2012/06/%E5%86%8D%E8%AE%AEsilt-a-memory-efficient-high-performance-key-value-store/

taurus db https://nan01ab.github.io/2020/06/Taurus.html

evendb

热点优化

https://www.jianshu.com/p/bc6a5ee0d3db

https://nan01ab.github.io/2020/06/KV-Store(2).html

这个人博客不错

https://www.jianshu.com/u/bb58761c6c04

https://sekwonlee.github.io/files/nvmw20_splitfs.pdf

(cppcon)一些老的编程规范的反思

14 Jun 2020 | |

goto harmful?

goto更像汇编一点。

感觉是老生长谈了，常用的goto场景还算处理错误码退出，还列了论文，高德纳的

goto 在c++：可能那个跳过构造函数，漏掉初始化(编译不过/编译告警)，注意（setjmp是不是也这样）

循环中提前退出, goto版本更高效（why??）

switch里用goto c++里没有对应的应用，类似 duff’s device????? 手动展开循环别手写，让编译器干这个活

用switch就算用goto了

讨论了其他语言中的使用套路 pattern, 没记录只记录c++相关的

不能跳过复杂类型初始化?non-vacuous怎么翻译？我大概理解就算没有默认构造的
不能跳出/跳入函数
不能用在constexpr函数
不能跳入try catch，跳出没事儿（都try-catch了还用goto感觉有点分裂）

函数退出集中起来。

（确实，多路返回的代码让人痛苦，我也写过那种。。不清晰）

上面讨论的for循环return 低效，可以把for中间判断限制一下，提前break，起到goto效果

返回值复杂，比如variant 使用overload trick，variant的switch

几种例外，能省构造等等。各取所需

成员变量private访问权限

封装不变量提前设计即使你用不到需要c#那种proprity？ class封装和struct那种透明的语义就不一样了。还是哪句话，用不到不要过度设计

声明就初始化

可能写成函数声明了，ecpp有一条某些场景没必要非得初始化你不用的，不要多付出

还有在函数开头声明，这是c的习惯，可能用不到，白白浪费构造其他语言也是一样，什么时候用什么时候声明两部初始化，工厂模式

ref

https://www.bilibili.com/video/BV1pJ411w7kh?p=12
ppt 在这里 https://github.com/CppCon/CppCon2019/blob/master/Presentations/some_programming_myths_revisited/some_programming_myths_revisited__patrice_roy__cppcon_2019.pdf

(cppcon)当零抽象失败怎么把编译器问题解决掉？

14 Jun 2020 | |

这个演讲者写了个python to c++的程序 pythran，牛逼阿这个演讲是pythran生成代码有个bug，抓bug的故事，涉及到llvm

演示了一个简单的add代码，性能差很多，编译器问题（怎么定界的这么快）

首先找最小复现代码片 c的实现是向量化的，但是llvm生成的ir没有向量化

用clang的 -Rpass-missed=loop-vectoreze -Rpass-analysis=loop-vectorize

分析得到没有循环优化

然后看llvm的代码，编译debug版本看打印

PHI node检测有问题

看checker的条件

inttoptr <=> ptrtoint 逻辑有问题？

这里我就跟不太上了，llvm不了解。得看一下llvm相关的东西

作者做了个去掉的patch，验证，结果向量化了

深层问题 SROA 已经提bug修了

回到标题，零成本抽象是牛逼的，但是需要编译器来达成这个优化

编译器有没有保证的最低程度优化？没。所以需要了解这个东西，了解优化程度作者的建议就是看ir结果，对比，跑omit， analyze，以及了解c 的llvm ir。简单

ref

https://github.com/serge-sans-paille/pythran
https://www.bilibili.com/video/BV1pJ411w7kh?p=154
PPT没找到

todo

看看llvm的资料

(cppcon)linux下c++现代调试工具手段

14 Jun 2020 | |

这第三页ppt介绍的也不能说modern吧。rr确实没用过

gdb

gdb -> ptrace ->signal

strace 也是用的ptrace

通过 ptrace(PTRACE_CONT) 传出去断点和单步，传的信号是SIGTRAP，退出是SIGINT

debug register??头一回知道

DWARF info细节

PC信息
堆栈信息
类型信心，函数原型，。。。。

readelf –debug-dump

info signals能看到所有信号的触发

调试符号优化没了，用-g3 （有没有性能影响？）

堆栈，堆栈指针的优化，CFA，注意，可以利用来导出堆栈（好像安全不让用？）

libthreaddb 库，用来调试

rr

没细说

valgrind, sanitizers

malloc free的实现是有隐藏细节的。导致意外的越界会有问题，这两个工具都是用来抓类似问题的

#cppcheck, coverity

一个coverity公司来做介绍。。这个ppt我见过，以前也有来我们公司的检查dead code 死循环，越界还算挺有效果的

简单介绍了一下原理？所有的checker都所定义好的，用调用图来算异常节点？

ref

https://www.bilibili.com/video/BV1pJ411w7kh?p=15
ppt https://github.com/CppCon/CppCon2019/tree/master/Presentations/modern_linux_cpp_debugging_tools__under_the_covers

(译)The Hunt for the Fastest Zero

13 Jun 2020 | |

一个场景，把长度为n的字符数组用0填满如果用c的话，大家肯定都用memset ，这个文章的主题是c++，咱们用c++来写，是这样的

void fill1(char *p, size_t n) {
    std::fill(p, p + n, 0);
}

但是，只添加几个字符，就会快29倍，很容易就写出性能比上面代码片更好的代码来，像这样

void fill2(char *p, size_t n) {
    std::fill(p, p + n, '\0');
}

作者用的是O2优化

函数	Bytes/Cycle
fill1	1.0
fill2	29.1

这两种写法有啥区别呢看汇编

fill1是这样的

fill1(char*, unsigned long):
      add rsi, rdi
      cmp rsi, rdi
      je .L1

.L3:
      mov BYTE PTR [rdi], 0 ;rdi存0
      add rdi, 1            ;rdi ++
      cmp rsi, rdi          ;比较rdi 和size大小
      jne .L3               ;继续循环L3
.L1:
      ret

能看出来这段代码就是按位赋值根据参考链接2方法论，这段代码主要瓶颈就是每个周期要有一次选择分支和保存值但是fill2可完全不一样

fill2:

fill2(char*, unsigned long):
        test rsi,rdi
        jne .L8
        ret
.L8:
        mov rdx, rsi
        xor esi, esi
        jmp memset ;尾调用memset

这里就不再分析为啥memset要快了。肯定比手写copy要快，有循环展开，且省掉了很多分支选择

但是为什么第一种写法不会直接调用memset呢作者一开始以为编译器做了手脚，试了O3优化，结果都优化成memset了

但是真正的原因，在std::fill的实现上

  /*
   *  ...
   *
   *  This function fills a range with copies of the same value.  For char
   *  types filling contiguous areas of memory, this becomes an inline call
   *  to @c memset or @c wmemset.
  */
  template<typename _ForwardIterator, typename _Tp>
  inline void fill(_ForwardIterator __first, _ForwardIterator __last, const _Tp& __value)
  {
    std::__fill_a(std::__niter_base(__first), std::__niter_base(__last), __value);
  }

std::fill根据某些traits做了优化，至于是那种场景呢？看std::__fill_a

  template<typename _ForwardIterator, typename _Tp>
  inline typename
  __gnu_cxx::__enable_if<!__is_scalar<_Tp>::__value, void>::__type
  __fill_a(_ForwardIterator __first, _ForwardIterator __last, const _Tp& __value)
  {
    for (; __first != __last; ++__first)
      *__first = __value;
  }

  // Specialization: for char types we can use memset.
  template<typename _Tp>
  inline typename
  __gnu_cxx::__enable_if<__is_byte<_Tp>::__value, void>::__type
  __fill_a(_Tp* __first, _Tp* __last, const _Tp& __c)
  {
    const _Tp __tmp = __c;
    if (const size_t __len = __last - __first)
      __builtin_memset(__first, static_cast<unsigned char>(__tmp), __len);
  }

根据这个SFINAE规则能看到，当T是is_byte的时候，才会触发调用memset fill1的写法，T的类型是整型常量，所以没触发优化成memset的版本等同于

std::fill<char *, int>(p, p + n, 0);

显式的指定函数模板参数，不用编译器推导，也能触发优化，像下面这个fill3

void fill3(char * p, size_t n) {
    std::file<char *, char>(p, p + n, 0);
}

按位复制优化成memset是编译器优化器做的。（优化器怎么做的？idiom recognition） gcc O3/ clang O2

对于第二种写法，不传’\0’,也可以使用 static_cast<char>(0)

后面作者给了个标准库的修改patch value的类型不必非得和指针类型一致就可以了

  template<typename _Tp, typename _Tvalue>
  inline typename
  __gnu_cxx::__enable_if<__is_byte<_Tp>::__value, void>::__type
  __fill_a(_Tp* __first, _Tp* __last, const _Tvalue& __c)
  {
    const _Tvalue __tmp = __c;
    if (const size_t __len = __last - __first)
      __builtin_memset(__first, static_cast<unsigned char>(__tmp), __len);
  }

但是这种改法，对自定义类型就不行

struct conv_counting_int {
    int v_;
    mutable size_t count_ = 0;

    operator char() const {
        count_++;
        return (char)v_;
    }
};

size_t fill5(char *p, size_t n) {
    conv_counting_int zero{0};
    std::fill(p, p + n, zero);
    return zero.count_;
}

返回值是1而不是n，优化反而让结果不对。这种场景，最好让这种自定义类型不合法比如

  template<typename _Tpointer, typename _Tp>
    inline typename
    __gnu_cxx::__enable_if<__is_byte<_Tpointer>::__value && __is_scalar<_Tp>::__value, void>::__type
    __fill_a( _Tpointer* __first,  _Tpointer* __last, const _Tp& __value) {
      ...

ref

https://travisdowns.github.io/blog/2020/01/20/zero.html
值得一看 https://travisdowns.github.io/blog/2019/06/11/speed-limits.html
这人的博客非常牛逼https://travisdowns.github.io 值得都看看