blog review 第二十一期 · 王很水的笔记

集群资源调度系统设计架构总结

看了，感觉没啥用，没讲调度算法。然后我误入歧途，跑去看调度学去了。哪个公式我眼睛快瞎了，看不懂

Hadoop YARN：调度性能优化实践

这个才是最值得看的，很多实践经验

Introducing OkayWAL: A write-ahead log for Rust

需要研究一下 sharded-log 和 okaywal 区别

Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask

你怎么知道我不会啊

Query parameter data types and performance

prepare 占位符语句，查询的类型不准确，导致优化器没能最优优化。数字被映射成numbric而不是bigint。怎么感觉是java接口的问题

Puzzling Postgres: a story of solving an unreproducible performance issue

这个例子和上面的差不多，不过是时间戳类型，时间戳类型，pg有bug

UNION ALL, data types and performance

案发现场

CREATE SEQUENCE seq;
 
CREATE TABLE bird (
   id bigint PRIMARY KEY DEFAULT nextval('seq'),
   wingspan real NOT NULL,
   beak_size double precision NOT NULL
);
 
CREATE TABLE bat (
   id bigint PRIMARY KEY DEFAULT nextval('seq'),
   wingspan numeric NOT NULL,
   body_temperature numeric NOT NULL
);
 
CREATE TABLE cat (
   id bigint PRIMARY KEY DEFAULT nextval('seq'),
   body_temperature numeric NOT NULL,
   tail_length numeric
);
 
CREATE VIEW flying_animal AS
   SELECT id, wingspan FROM bird
UNION ALL
   SELECT id, wingspan FROM bat;
 
CREATE VIEW mammal AS
   SELECT id, body_temperature FROM bat
UNION ALL
   SELECT id, body_temperature FROM cat;

两边子查询类型对不上，优化器没匹配最佳类型

CREATE OR REPLACE VIEW flying_animal AS
   SELECT id, wingspan FROM bird
UNION ALL
   SELECT id, wingspan::real FROM bat;

How PlanetScale Boost serves your SQL queries instantly

核心还是Noria论文的点子，类似的公司materialize，都是利用这个增量状态构造一个视图，还不是物化视图那种，也不是那种简单的cache。中间态memsql？主要是有窗口的概念。

The technology behind GitHub’s new code search

有句讲句，我觉得github搜索很垃圾

Features I’d like in PostgreSQL

–i-am-a-dummy 所有update delete都过滤掉，哈哈，可以
Unit test mode (random result sorting) 结果随机排序，除非你用ORDER BY
Query progress in psql 能看到进度
Pandas-like join validation 这个不太懂

SELECT x, y
FROM t1
JOIN t2 USING (key) VALIDATE 1:m

JIT support for CREATE INDEX
Reduce the memory usage of prepared queries 这俩不懂
Internode Cache Thrashing: Hunting a NUMA Performance Bug

这个定位非常非常非常精彩

首先，perf

 	
sudo perf stat -C8 --timeout 10000

火焰图

git clone https://github.com/brendangregg/FlameGraph
git -C FlameGraph remote add adamnovak https://github.com/adamnovak/FlameGraph
git -C FlameGraph fetch adamnovak
git -C FlameGraph cherry-pick 7ff8d4c6b1f7c4165254ad8ae262f82668c0c13b # C++ template display fix
 
x=remote
sudo timeout 10 perf record --call-graph=fp -C8 -o $x.data
sudo perf script -i $x.data > $x.perf
FlameGraph/stackcollapse-perf.pl $x.perf > $x.folded
FlameGraph/flamegraph.pl $x.folded > $x.svg

查到 compact_radix_tree::tree::get_at() and database::apply(). 有问题

sudo perf annotate -i $x.data

代码已经找到，但是为啥？？

查事件

sudo perf stat --timeout 1000000 -C8 ...events... -x\t 2>&1 | sed 's/<not counted>/0/g'

需要关注的事件

CPU_CYCLES, obviously, because we were doing the measurement for the same amount of time in both cases.
LDREX_SPEC “exclusive operation speculatively executed” — but since it happens only 1,000 times per second, it can’t possibly be the cause.
EXC_UNDEF “number of undefined exceptions taken locally” — I don’t even know what this means, but it doesn’t seem like a reasonable bottleneck.
STALL_BACKEND only supports our suspicion that the CPU is bottlenecked on memory somehow.
REMOTE_ACCESS

REMOTE_ACCESS明显离谱了，seastar已经绑核，哪里来的跨核访问？？？

程序本身的静态数据跨核了？？？？

sudo cat /proc/$(pgrep -x scylla)/numa_maps

N0=x N1=y means that x pages in the address range are allocated on node 0 and y pages are allocated on node 1. By cross-referencing readelf --headers /opt/scylladb/libexec/scylla we can determine that .text, .rodata and other read-only sections are on node 0, while .data, .bss and other writable sections are on node 1.

发现这几个段不在一个核？？不应该啊

强制绑核，发现问题确实如此 /usr/bin/numactl --membind 1 to /usr/bin/scylla scylla_args…:

用mbind分析为什么，发现了一个page有共享问题，那就是cacheline颠簸了

Using this ability, we discover that only one page matters: 0x28c0000, which contains .data, .got.plt and the beginning of .bss. When this page is on node 1, the run is slow, even if all other pages are on node 0. When it’s on node 0, the run is fast, even if all other pages are on node 1.

尝试改二进制，加padding，解决了？？根因是什么？怎么加padding？

We can move the suspicious area by stuffing some padding before it. .tm_clone_table seems like a good enough place to do that. We can add an array in .tm_clone_table somewhere in ScyllaDB and recompile it. (By the way, note that our hacked-in mbind API writes something to this array to prevent it from being optimized out. If it wasn’t used, the linker would discard it because ScyllaDB is compiled with -fdata-sections).

Let’s try to pad .got.plt to a page boundary to test this hack.

既然找到问题，就gdb抓堆栈

sudo gdb -p (pgrep -x scylla)
(gdb) watch *0x28d0000
(gdb) watch *0x28d0008
(gdb) watch *0x28d0010
(gdb) watch *0x28d0018
(gdb) continue

击中之后看一下符号

(gdb) info symbol 0x28d0000

修复

       node_head_ptr& operator=(node_head* v) noexcept {
            _v = v;
 -          if (_v != nullptr) {
            // Checking (_v != &nil_root) is not needed for correctness, since
            // nil_root's _backref is never read anyway. But we do this check for
            // performance reasons: since nil_root is shared between shards,
            // writing to it would cause serious cache contention.
 +          if (_v != nullptr && _v != &nil_root) {
                _v->_backref = this;
            }
            return *this;

这个查问题的方式，后半部分，已经超出我的知识范围了。我只能说牛逼。

Building a Simple DB in Rust - Part 2 - Basic Execution

可以当个玩具玩玩