C++ 中文周刊 2024-07-20 第164期

#include <benchmark/benchmark.h>
#include <vector>
#include <algorithm>

constexpr int kSize = 10000000;  
std::vector<int> data(kSize);
std::vector<int> indices(kSize);

static void BM_CacheCold(benchmark::State& state) {
  // Generate random indices
  for(auto& index : indices) {
    index = rand() % kSize;
  }
  for (auto _ : state) {
    int sum = 0;
    // Access data in random order
    for (int i = 0; i < kSize; ++i) {
      benchmark::DoNotOptimize(sum += data[indices[i]]);
    }
    benchmark::ClobberMemory();
  }
}

static void BM_CacheWarm(benchmark::State& state) {
  // Warm cache by accessing data in sequential order
  int sum_warm = 0;
  for (int i = 0; i < kSize; ++i) {
    benchmark::DoNotOptimize(sum_warm += data[i]);
  }
  benchmark::ClobberMemory();
 
  // Run the benchmark
  for (auto _ : state) {
    int sum = 0;
    // Access data in sequential order again
    for (int i = 0; i < kSize; ++i) {
      benchmark::DoNotOptimize(sum += data[i]);
    }
    benchmark::ClobberMemory();
  }
}

测试数据直接快十倍，之前也讲过类似的场景

利用模版和constexpr 这个就不多说了
循环展开这个也不说了
区分快慢路径
减少错误分支
prefetch

一个例子

#include <benchmark/benchmark.h>
#include <vector>

// Function without __builtin_prefetch
void NoPrefetch(benchmark::State& state) {
  // Create a large vector to iterate over
  std::vector<int> data(state.range(0), 1);
  for (auto _ : state) {
    long sum = 0;
    for (const auto& i : data) {
      sum += i;
    }
    // Prevent compiler optimization to discard the sum
    benchmark::DoNotOptimize(sum);
  }
}
BENCHMARK(NoPrefetch)->Arg(1<<20); // Run with 1MB of data (2^20 integers)


// Function with __builtin_prefetch
void WithPrefetch(benchmark::State& state) {
  // Create a large vector to iterate over
  std::vector<int> data(state.range(0), 1);
  for (auto _ : state) {
    long sum = 0;
    int prefetch_distance = 10;
    for (int i = 0; i < data.size(); i++) {
      if (i + prefetch_distance < data.size()) {
    	__builtin_prefetch(&data[i + prefetch_distance], 0, 3);
      }
      sum += data[i];
    }
    // Prevent compiler optimization to discard the sum
    benchmark::DoNotOptimize(sum);
  }
}
BENCHMARK(WithPrefetch)->Arg(1<<20); // Run with 1MB of data (2^20 integers)

BENCHMARK_MAIN();

论文中快30%，当然编译器可以向量化的吧，不用手动展开吧

有符号无符号整数比较，慢，避免
float double混用慢，避免
SSE加速
mutex替换成atomic (这个还是取决于应用场景)
bypass

还有其他模块介绍就不谈了，比较偏HFT

文章

C++ Error Handling Strategies – Benchmarks and Performance

浅析Cpp 错误处理

最近不约而同有两个关于错误处理的压测

第一个文章没有体验出正确路径错误路径不同压力的表现，只测了错误路径，因此没啥代表价值。只是浅显的说了异常代价大，谁还不知道这个

问题是什么情况用异常合适？异常不是你期待的东西，如果你的错误必须处理，那就不叫异常

另外第二篇文章是群友写的，给了个50%失败错误路径的测试，结果符合直觉

结论: 异常在happy path出现的路径下收益高(错误出现非常少)

当然已经有提案说过这个事情

我相信大多数人都没看过这个，是的我之前也没有看过

这个话题可以展开讲一下，这里标记一个TODO

When `__cxa_pure_virtual` is just a different flavor of SEGFAULT

有时候可能会遇到这种打印挂掉pure virtual method called

一个简单的复现代码

class Base {
public:
    Base() {fn();} // thinking it would be calling the Derived's `fn()`
    // the same happens with dtor
    // virtual ~Base() {fn();}
    virtual void fn() = 0;
};

class Derived : public Base{
public:
    void fn() {}
};

int main() {
    Derived d;
    return 0;
}

简单来说基类构造的时候子类还没构造，fn没绑定，还是纯虚函数，就会挂

不要这么写，不要在构造函数中调用虚函数

What’s the point of std::monostate? You can’t do anything with it!

就是空类型，帮助挡刀的

比如这个场景

struct Widget {
    // no default constructor
    Widget(std::string const& id);
};

struct PortListener {
    // default constructor has unwanted side effects
    PortListener(int port = 80);
};

std::variant<Widget, PortListener> thingie; // can't do this

我们想让Widget当第一个，但是Widget没有默认构造，PortListener放第一个又破坏可读性，对应关系乱了

怎么办，monostate出场

std::variant<std::monostate, Widget, PortListener> thingie;

帮Widget挡一下编译问题

顺便一提，monostate的hash

  result_type operator()(const argument_type&) const {
    return 66740831; // return a fundamentally attractive random value.
  }

开源项目介绍

asteria 一个脚本语言，可嵌入，长期找人，希望胖友们帮帮忙，也可以加群753302367和作者对线
https://github.com/Psteven5/CFLAG.h/tree/main 一个类似GFLAGS的库，但是功能非常简单，口述一下原理

示例代码

int main(int argc, char **argv) {
  char const *i = NULL, *o = NULL;
  bool h = false;
  CFLAGS(i, o, h);
  printf("i=%s, o=%s, h=%s\n", i, o, h ? "true" : "false");
}
// ./main -i=main.js -h -o trash

主要原理就是利用c11的_Generic

介绍一下generic用法

#include <math.h>
#include <stdio.h>
 
// Possible implementation of the tgmath.h macro cbrt
#define cbrt(X) _Generic((X), \
              long double: cbrtl, \
                  default: cbrt,  \
                    float: cbrtf  \
              )(X)
 
int main(void) {
    double x = 8.0;
    const float y = 3.375;
    printf("cbrt(8.0) = %f\n", cbrt(x)); // selects the default cbrt
    printf("cbrtf(3.375) = %f\n", cbrt(y)); // converts const float to float,
                                            // then selects cbrtf
}

看懂了吧，_Generic根据输入能生成自定义的语句，上面的例子根据X生成对应的函数替换

能换函数，那肯定也能换字符串，这个关键字能玩的很花哨

回到我们这个flags，和Gflags差不多，我们怎么实现？

我们考虑一个最简单的场景 CFLAGS(i)，应该展开成解析arg遍历匹配字符串i并讲对应的值赋值给i，这个赋值得通过格式化字符串复制

遍历arg好实现，通过argc argv遍历就行，i字符串话也简单 #，把argv的值赋值, sscanf，格式化字符串哪里来？generic

大概思路已经有了，怎么实现大家看代码吧

互动环节

周末看了抓娃娃，和西虹市首富差不多，结尾马马虎虎。还算好笑

好久没去电影院的椅子全变成带按摩的了，离谱，被赚了20

C++ 中文周刊 164期补充

昨天更新关于HFT的内容有误

cache warm 效果有限，加热icache缺乏其他优化验证修复，比如pgo，比如调大tlb。当然cache warm对于可以观测数据集预估业务的场景来说，简单粗暴，不过对于优化而言，很难说问题的根因在哪里，PGO应该是最直观的，cache warm给人一种野路子歪打正着的感觉，需要进一步分析。对于不可预估后端场景，cache warm就相当于CPU做无用功了，一定要测试，测试，测试
其他例子，比如prefetch等，例子粗糙，缺少系统视角，如果缺乏这个知识需要科普，看这个小册子，反而可能造成误导

需要系统了解可以看现代cpu性能分析与优化有中文版本

英文版 https://book.easyperf.net/perf_book

中文版本可能比较旧，但对于科普系统学习知识也足够，京东77应该是涨价了，我买的时候是50

公开课可以学一下mit 6.172 b站有视频 ppt可以这里下载 https://ocw.mit.edu/courses/6-172-performance-engineering-of-software-systems-fall-2018/

实际上优化相关知识广，碎，杂，需要系统整体视角

本文感谢崔博武 Anien 指正

上一期

看到这里或许你有建议或者疑问或者指出错误，请留言评论! 多谢! 你的评论非常重要！也可以帮忙点赞收藏转发！多谢支持！

觉得写的不错那就给点吧, 在线乞讨

This site is open source. Improve this page.