C++ 中文周刊 第101期

周刊项目地址

公众号

RSS https://github.com/wanghenshui/cppweeklynews/releases.atom

欢迎投稿,推荐或自荐文章/软件/资源等

提交 issue

最近有点忙,看的不是很细

资讯

之前聊过很多次的perf book,有中文版本了,中文名 现代CPU性能分析与优化

https://item.m.jd.com/product/10068178465763.html

这里没有带货的意思嗷,英语比较熟的,可以在这里免费获取这本书

https://book.easyperf.net/perf_book 填一下邮箱就会发给你邮件

​如果不熟,支持中文书也可以买一下。不过我感觉新书刚上有点贵了,一般来说三月末有个读书节之类的活动,有打折,可以到时候再买。

另外就是有没有出版社大哥能不能赞助两本我抽了,没有我就三月底自己买来抽了


A call to action:Think seriously about “safety”; then do something sensible about it

针对NSA的c++不安全的说法,BS慌了。和大家讨论一下改进措施

另外这里也吵翻天了 C++ 之父为什么说 Rust 等内存安全语言的安全性并不优于 C++? https://www.zhihu.com/question/584122632

C++23 “Pandemic Edition” is complete (Trip report: Winter ISO C++ standards meeting, Issaquah, WA, USA)

草药老师发了开会总结

AMD RDNA™ 3 指令集架构 (ISA) 参考指南现已推出

用了新显卡的关注下

编译器信息最新动态推荐关注hellogcc公众号 本周更新 2023-02-15 第189期

文章

enum的周边设施,std::is_enum std::underlying_type std::is_scoped_enum std::to_underlying

UE教程

std::initializer_list<int> wrong() { // for illustration only!
    return { 1, 2, 3, 4};
}
int main() {
    std::initializer_list<int> x = wrong();
}

初始化列表的坑爹之处,生命周期有问题,别这么写。没关系,编译器会告警的。你说你不看告警?

算字符串占多少

size_t scalar_utf8_length(const char * c, size_t len) {
  size_t answer = 0;
  for(size_t i = 0; i<len; i++) {
    if((c[i]>>7)) { answer++;}
  }
  return answer + len;
}

显然,可以SIMD

size_t avx2_utf8_length_basic(const uint8_t *str, size_t len) {
  size_t answer = len / sizeof(__m256i) * sizeof(__m256i);
  size_t i;
  for (i = 0; i + sizeof(__m256i) <= len; i += 32) {
    __m256i input = _mm256_loadu_si256((const __m256i *)(str + i));
   answer += __builtin_popcount(_mm256_movemask_epi8(input));
  }
  return answer + scalar_utf8_length(str + i, len - i);
}

优化一下

ize_t avx2_utf8_length_mkl(const uint8_t *str, size_t len) {
  size_t answer = len / sizeof(__m256i) * sizeof(__m256i);
  size_t i = 0;
  __m256i four_64bits = _mm256_setzero_si256();
  while (i + sizeof(__m256i) <= len) {
    __m256i runner = _mm256_setzero_si256();
    size_t iterations = (len - i) / sizeof(__m256i);
    if (iterations > 255) { iterations = 255; }
    size_t max_i = i + iterations * sizeof(__m256i) - sizeof(__m256i);
    for (; i <= max_i; i += sizeof(__m256i)) {
      __m256i input = _mm256_loadu_si256((const __m256i *)(str + i));
      runner = _mm256_sub_epi8(
        runner, _mm256_cmpgt_epi8(_mm256_setzero_si256(), input));
    }
    four_64bits = _mm256_add_epi64(four_64bits, 
      _mm256_sad_epu8(runner, _mm256_setzero_si256()));
  }
  answer += _mm256_extract_epi64(four_64bits, 0) +
    _mm256_extract_epi64(four_64bits, 1) +
    _mm256_extract_epi64(four_64bits, 2) +
    _mm256_extract_epi64(four_64bits, 3);
    return answer + scalar_utf8_length(str + i, len - i);
}

代码在这 https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2023/02/16 我已经看不懂了

帮你抓安全问题

LLVM变动,不懂不评价

llvm变动,没细看,不过使用libc++可以白捡这个llvm优化,人家clickhouse都用上了

module体验

这个概念是需要掌握的。虚拟地址,页表等等

手把手教你渲染个飞机

MaskRay写完UBsan介绍又写这个了。笔耕不辍这是

聪明的你肯定想到了还原堆栈要怎么做。汇编我看不懂,你比我聪明

论文。实现个checker

聪明的你肯定想到了pthread_cancel以及SIG_CANCEL,然后怎么实现??

评审c代码的一些经验

比如 assert

#define ASSERT(c) if (!(c)) __builtin_trap()

再比如

char *s = ...;
if (isdigit(s[0] & 255)) {
    ...
}

为什么不能直接用?

或者直接用这玩意

_Bool xisdigit(char c)
{
    return c>='0' && c<='9';
}

还有setjmp and longjmp 信号atomic之类的。都没细说。总之谨慎

讲了几个其他语言优化更好的点,替换c++。没啥说的。都能替

@wu-hanqing 投稿。咱们之前在94期也提到过,就是shared_ptr有个别名构造。别用。很坑。鼓励大家投稿。不然显得我玩单机互联网

template<class T>
concept foo_like = requires(T t) { t.foo; };

template<auto Concept>
struct foo {
  auto fn(auto f) {
    static_assert(requires { Concept(f); });
  }
};

int main() {
  foo<[](foo_like auto){}> f{};

  struct { int foo{}; } foo;
  struct { } bar;

  f.fn(foo); // ok
  f.fn(bar); // error: contrain not satisfied
}

注意这个concept套娃用法,通过lambda绕进去

我觉得还是看个乐。这玩意以后肯定不能这么写,过于邪门歪道

还不懂?再看一遍

这个定位非常非常非常精彩

首先,perf

 	
sudo perf stat -C8 --timeout 10000

火焰图

git clone https://github.com/brendangregg/FlameGraph
git -C FlameGraph remote add adamnovak https://github.com/adamnovak/FlameGraph
git -C FlameGraph fetch adamnovak
git -C FlameGraph cherry-pick 7ff8d4c6b1f7c4165254ad8ae262f82668c0c13b # C++ template display fix
 
x=remote
sudo timeout 10 perf record --call-graph=fp -C8 -o $x.data
sudo perf script -i $x.data > $x.perf
FlameGraph/stackcollapse-perf.pl $x.perf > $x.folded
FlameGraph/flamegraph.pl $x.folded > $x.svg

查到 compact_radix_tree::tree::get_at() and database::apply(). 有问题

sudo perf annotate -i $x.data

代码已经找到,但是为啥??

查事件

sudo perf stat --timeout 1000000 -C8 ...events... -x\t 2>&1 | sed 's/<not counted>/0/g'

需要关注的事件

CPU_CYCLES, obviously, because we were doing the measurement for the same amount of time in both cases.
LDREX_SPEC “exclusive operation speculatively executed” — but since it happens only 1,000 times per second, it can’t possibly be the cause.
EXC_UNDEF “number of undefined exceptions taken locally” — I don’t even know what this means, but it doesn’t seem like a reasonable bottleneck.
STALL_BACKEND only supports our suspicion that the CPU is bottlenecked on memory somehow.
REMOTE_ACCESS

REMOTE_ACCESS明显离谱了,seastar已经绑核,哪里来的跨核访问???

程序本身的静态数据跨核了????

sudo cat /proc/$(pgrep -x scylla)/numa_maps
N0=x N1=y means that x pages in the address range are allocated on node 0 and y pages are allocated on node 1. By cross-referencing readelf --headers /opt/scylladb/libexec/scylla we can determine that .text, .rodata and other read-only sections are on node 0, while .data, .bss and other writable sections are on node 1.


发现这几个段不在一个核??不应该啊

强制绑核,发现问题确实如此 /usr/bin/numactl --membind 1 to /usr/bin/scylla scylla_args…:

用mbind分析为什么,发现了一个page有共享问题,那就是cacheline颠簸了

Using this ability, we discover that only one page matters: 0x28c0000, which contains .data, .got.plt and the beginning of .bss. When this page is on node 1, the run is slow, even if all other pages are on node 0. When it’s on node 0, the run is fast, even if all other pages are on node 1.

尝试改二进制,加padding,解决了??根因是什么?怎么加padding?

We can move the suspicious area by stuffing some padding before it. .tm_clone_table seems like a good enough place to do that. We can add an array in .tm_clone_table somewhere in ScyllaDB and recompile it. (By the way, note that our hacked-in mbind API writes something to this array to prevent it from being optimized out. If it wasn’t used, the linker would discard it because ScyllaDB is compiled with -fdata-sections).

Let’s try to pad .got.plt to a page boundary to test this hack.

既然找到问题,就gdb抓堆栈

sudo gdb -p (pgrep -x scylla)
(gdb) watch *0x28d0000
(gdb) watch *0x28d0008
(gdb) watch *0x28d0010
(gdb) watch *0x28d0018
(gdb) continue

击中之后看一下符号

(gdb) info symbol 0x28d0000

修复

       node_head_ptr& operator=(node_head* v) noexcept {
            _v = v;
 -          if (_v != nullptr) {
            // Checking (_v != &nil_root) is not needed for correctness, since
            // nil_root's _backref is never read anyway. But we do this check for
            // performance reasons: since nil_root is shared between shards,
            // writing to it would cause serious cache contention.
 +          if (_v != nullptr && _v != &nil_root) {
                _v->_backref = this;
            }
            return *this;

这个查问题的方式,后半部分,已经超出我的知识范围了。我只能说牛逼。

%:include <iostream>

struct A <%
    A(int bitand a) : a(a) {}
    int bitand a;
%>;


int main(int argc, char**argv)
<%
    if(argc not_eq 2) <% return 1;%>

    int n = std::atoi(argv<:1:>);
    A a(n);

    auto func = <:bitand:>(A a)<%
        std::cout << a.a << std::endl;
    %>;

    func(a);
    return 0;
%>

这种符号表达已经废弃了。C的糟粕

raymond chen 经典介绍winapi。看不懂就不多逼逼了

视频

没啥说的

看PPT就感觉很精彩了。视频还没更

PPT在这里 https://meetingcpp.com/mcpp/slides/2022/Basic%20usage%20of%20PMRs%20for%20better%20performance8308.pdf

业务场景以及使用方法,都介绍了。

AA大神都演讲一直都很有意思。不过我没看。最近线上出事了写复盘报告没时间

开源项目需要人手

新项目介绍/版本更新


本文永久链接

如果有疑问评论最好在上面链接到评论区里评论,这样方便搜索,微信公众号有点封闭/知乎吞评论

看到这里或许你有建议或者疑问或者指出错误,请留言评论! 多谢! 你的评论非常重要!也可以帮忙点赞收藏转发!多谢支持! 觉得写的不错那就给点吧, 在线乞讨 微信转账