C++ 中文周刊 2026-03-15 第197期

周刊项目地址

公众号

点击「查看原文」跳转到 GitHub 上对应文件，链接就可以点击了

qq群 753792291 答疑在这里

RSS

欢迎投稿，推荐或自荐文章/软件/资源等，评论区留言

本期文章没人赞助

上期文章UB介绍太多，群友严重抗议，以后不发太多UB鉴赏来

资讯

标准委员会动态/ide/编译器信息放在这里

性能周刊

文章

C++26: The Oxford variadic comma

C++26要废弃省略号参数前不加逗号的写法（P3176R1）。名字来自英文里的”Oxford comma”——列表里最后一个”and”前面的逗号，这里要求省略号前必须加逗号

现状： 目前C++允许两种写法：

void foo(int, ...);  // 有逗号（C兼容）
void foo(int...);    // 没逗号（仅C++）

没逗号的写法来自前标准时代的C++，C从来不允许省逗号

混乱来源： C++11引入了参数包，T...在不同上下文含义完全不同：

template<class Ts>
void f(Ts...); // 注意！不是参数包，是Ts类型后跟省略号参数

template<class... Ts>
void f(Ts... args);  // 这才是参数包

更离谱的是abbreviated function template的写法：

void g(auto... args);    // variadic function template
void g(auto args...);    // 非variadic + 省略号参数

还有终极混沌——六个点：

void h(auto......);  // 等价于 (auto..., ...)

C++26的处理： 废弃不加逗号的写法，不是移除。加个逗号就行了，工具可以自动化

这个废弃还为未来的语言特性（比如P1219R2“同构可变参数”）腾出了语法空间

省流：以后省略号前面记得加逗号就对了

C++26: std::is_within_lifetime

C++26在<type_traits>里加了bool std::is_within_lifetime(const T* p)，用来在编译期检查指针指向的对象是否在生命周期内，最常见的用途——检查union的当前活跃成员

union Storage {
  int i;
  double d;
};

constexpr bool check_active_member() {
  Storage s;
  s.i = 42;
  return std::is_within_lifetime(&s.i);  // true
}

设计细节：

是consteval的，只能编译期使用，运行时不能调
接收指针而不是引用，避免临时对象和生命周期延长的麻烦
名字叫is_within_lifetime而不是is_union_member_active，因为委员会选择在更底层解决问题

核心动机： 实现一个空间最优的Optional<bool>

struct OptBool {
  union { bool b; char c; };

  constexpr auto has_value() const -> bool {
    if consteval {
      return std::is_within_lifetime(&b);
    } else {
      return c != 2;  // 哨兵值
    }
  }

  constexpr bool value() const {
    return b;
  }
};

编译期用is_within_lifetime检查活跃成员，运行时用哨兵值。两全其美

目前（2026年2月）主流编译器还没支持。等吧

Understanding std::shared_mutex from C++17

std::shared_mutex科普文。读写锁嘛，读多写少的场景用它

问题： std::mutex所有访问都互斥，即使多个线程只是读，也得排队等

方案： std::shared_mutex支持两种锁模式：

shared ownership：多线程同时持有（读）
exclusive ownership：单线程独占（写）

代码改动很小：

class Counter {
public:
    int get() const {
        std::shared_lock lock(mutex_);  // 共享锁
        return value_;
    }

    void increment() {
        std::unique_lock lock(mutex_);  // 独占锁
        ++value_;
    }

private:
    mutable std::shared_mutex mutex_;
    int value_{0};
};

benchmark结果：在read-heavy场景下，std::mutex用了285ms，std::shared_mutex只要102ms

常见陷阱：

不能递归锁，UB
不能从shared lock升级到unique lock，会死锁
写操作频繁或锁竞争低的场景，std::mutex可能反而更快

该measure还得measure

How do compilers ensure that large stack allocations do not skip over the guard page?

Raymond Chen又在讲Windows栈的故事

系统在栈底维护一个guard page，触碰到它就会把它提交为已用内存并在下面创建新的guard page。但如果一个函数局部变量太大（超过一页，通常4KB），直接跳过guard page访问到了reserved区域就会crash

解决方案： 编译器在栈指针需要移动超过一页大小时，会插入一个_chkstk辅助函数调用。这个函数按顺序逐页touch所有需要的页面，让guard page机制正常工作

又是WinAPI，了解一下底层还是可以的

Prefix sums at tens of gigabytes per second with ARM NEON

Daniel Lemire讲如何用ARM NEON SIMD指令加速前缀和

标量版本： 简单循环，每个元素依赖前一个，理论上限 = CPU频率 × 1 entry/cycle。4GHz处理器约 3.9 GB/s

朴素SIMD： 对4个元素做前缀和需要2次shift+2次add，加上carry传递，并不比标量快：

input   = [A B   C     D]
shift1  = [0 A   B     C]
sum1    = [A A+B B+C   C+D]
shift2  = [0 0   A     B+A]
result  = [A A+B A+B+C A+B+C+D]

4条sequential指令处理4个值，还不如标量一条一个

关键优化： 利用NEON的interleaved load/store（vld4q_u32）一次加载16个值，自动deinterleave成4组，按组并行做prefix sum，然后汇总。理论快2倍

original data : ABCD EFGH IJKL MNOP
loaded data   : AEIM BFJN CGKO DHLP

完整实现：

void neon_prefixsum_fast(uint32_t *data, size_t length) {
  uint32x4_t zero = {0, 0, 0, 0};
  uint32x4_t prev = {0, 0, 0, 0};

  for (size_t i = 0; i < length / 16; i++) {
    uint32x4x4_t vals = vld4q_u32(data + 16 * i);

    // Prefix sum inside each transposed ("vertical") lane
    vals.val[1] = vaddq_u32(vals.val[1], vals.val[0]);
    vals.val[2] = vaddq_u32(vals.val[2], vals.val[1]);
    vals.val[3] = vaddq_u32(vals.val[3], vals.val[2]);

    // Now vals.val[3] contains the four local prefix sums:
    //   vals.val[3] = [s0=A+B+C+D, s1=E+F+G+H,
    //                  s2=I+J+K+L, s3=M+N+O+P]

    // Compute prefix sum across the four local sums
    uint32x4_t off = vextq_u32(zero, vals.val[3], 3);
    uint32x4_t ps = vaddq_u32(vals.val[3], off);
    off = vextq_u32(zero, ps, 2);
    ps = vaddq_u32(ps, off);

    // Add the incoming carry from the previous 16-element block
    ps = vaddq_u32(ps, prev);

    // Prepare carry for next block: broadcast the last lane of ps
    prev = vdupq_laneq_u32(ps, 3);

    // The add vector to apply to the original lanes
    uint32x4_t add = vextq_u32(prev, ps, 3);

    // Apply carry/offset to each of the four transposed lanes
    vals.val[0] = vaddq_u32(vals.val[0], add);
    vals.val[1] = vaddq_u32(vals.val[1], add);
    vals.val[2] = vaddq_u32(vals.val[2], add);
    vals.val[3] = vaddq_u32(vals.val[3], add);

    // Store back the four lanes (interleaved)
    vst4q_u32(data + 16 * i, vals);
  }
}

核心就是8条sequential指令处理16个值，理论上比标量快2倍

在Apple M4上的结果：

方法	GB/s
scalar	3.9
naive SIMD	3.6
fast SIMD	8.9

2.3倍加速，不错

Learning to read C++ compiler errors: Ambiguous overloaded operator

Raymond Chen讲怎么读编译器错误。一个老代码加了C++/WinRT后，32位构建炸了——operator<<重载歧义

问题根源：代码里有条件编译的__int64版operator<<，只在非64位、非STL7.0、非STL11.0时编译。升级到C++17（C++/WinRT要求）后STL版本变成12.0，条件不匹配又激活了

#if !defined(_WIN64) && !defined(_STL70_) && !defined(_STL110_)
// These are already defined in STL
std::ostream& operator<<(std::ostream&, const __int64& );
std::ostream& operator<<(std::ostream&, const unsigned __int64& );
#endif

解法： 直接删掉整个块。都2026了不可能回退到C++03了

Raymond Chen说的好：历史总在重演。五年前加了_STL110_，现在又得打补丁。与其贴胶带不如直接删

Partial Truth vs Explicit Failure: Designing Honest System Responses

API设计的取舍讨论：系统部分失败时，是返回不完整的”成功”响应，还是直接报错？

核心观点：

成功响应隐含了完整性承诺，下游会当真
部分失败伪装成成功，监控是绿的但正确性在默默恶化
partial response不是不能用，但必须明确标注什么是缺失的
optional<bool>里false和std::nullopt完全是两回事——”不”和”不知道”不一样

Is resize + assign faster than reserve + emplace_back for vector?

Jens Weller在quick-bench上测了两种vector填充策略

用clang 17：

size_t：resize + assign快1.1倍
4个size_t的struct：reserve + emplace_back快1.2倍

有个转折点，数据类型大小不同结论不同。让vector增长的测试中差异更小

Follow up: resize + assign is often faster than reserve + emplace_back for vector

上面那篇的后续。换成GCC(13.2)，resize + assign成了稳赢：

size_t快1.9倍
struct快1.6倍

clang + libc++也类似，size_t最高快2.1倍

结论：你要是有大块填充vector的场景，值得测一下

Convenience Gone Wrong: A C++ auto Story

嵌入式老哥花了两小时debug一个BME280传感器驱动，I2C数据明明正确但校准数据始终为空

罪魁祸首：

auto cal = inst.calibration;  // BUG: 这是个拷贝！

修复：

- auto cal = inst.calibration;
+ auto& cal = inst.calibration;

少了一个&，auto默默做了个拷贝，校准数据写到了局部变量里，函数结束就析构了

评论区有人说得好：删掉copy/move constructor就能在编译期发现这种问题。道理都懂，但就是会栽坑。这就是”图省事”的代价——写auto省了几个字符，debug花了两小时

Accessing inactive union members through char: the aliasing rule you didn’t know about

上面OptBool的has_value()里运行时分支 return c != 2; 看起来像访问了union的非活跃成员，应该UB才对？

其实不是UB！ C++标准有个例外：通过char、unsigned char或std::byte类型的glvalue，可以合法访问任何对象的byte表示。char本质上就是”指向byte的指针”

bool b活跃时通过char c读：合法（char可以alias任何类型）
char c活跃时通过bool b读：UB（bool不在aliasing例外列表里）

这个知识点挺冷门的，但搞union的时候得知道

Behold the power of meta::substitute

Barry Revzin展示C++26 reflection的std::meta::substitute有多强大

核心思路： 用substitute把reflection值替换进模板，用extract把值拉出来，实现”函数而非函数模板”的编程范式

具体实现了一个highlight_print——解析format string在编译时完成，运行时直接调用生成好的函数指针。整个解析引擎是普通函数而不是函数模板：

consteval auto parse_information(std::span<std::meta::info const> arg_types,
                                 std::string_view sv)
    -> std::meta::info;

关键的substitute/extract/invoke三连：

substitute(^^fmt_types, arg_types) — 把reflection值替换进变量模板
extract<T const*>(r) — 从reflection中拉出实际值
调用提取出的函数指针

最后通过consteval构造函数把格式化字符串解析和函数指针提取都在编译期完成

文章还讨论了如果C++有constexpr函数参数和consteval mutable变量（像Zig那样），实现可以简化多少

抽象程度很高，值得反复看。Compiler Explorer demo可以直接跑

C++ Reflection: Another Monad

Ben Deane看了Barry上面那篇文章后直接拍桌子：C++26 reflection就是个monad！

用^^做pure（把类型提升到reflection-land），substitute做fmap（甚至是n-ary的，等价于applicative functor），extract<std::meta::info>做join：

template <typename T>
constexpr auto pure = ^^T;

template <template <typename...> typename F>
consteval auto fmap(std::same_as<std::meta::info> auto... as) {
  return substitute(^^F, {as...});
}

consteval auto join(std::meta::info a) {
  return extract<std::meta::info>(a);
}

还验证了monad三律（left identity, right identity, associativity）。Compiler Explorer链接

所以reflection强大的原因之一：它实现了类型函数组合的monad。你品，你细品

Best performance of a C++ singleton

Andreas Fertig对比singleton的两种实现方式在性能上的差异

block local static（常见做法）：

static DisplayManager& Instance() {
    static DisplayManager dspm;
    return dspm;
}

如果默认构造函数可以= default（user-declared）：生成的汇编很干净
如果构造函数有实现体（user-defined）：编译器必须插入guard variable + __cxa_guard_acquire/__cxa_guard_release，每次访问都要检查，性能损失明显

static data member方式：

class DisplayManager {
    static DisplayManager mDspm;
    // ...
    static DisplayManager& Instance() { return mDspm; }
};

user-defined构造函数时，不需要guard variable，初始化走_GLOBAL__sub_I_，汇编简洁得多

结论： 需要user-defined构造函数时，static data member方案性能更好。构造函数能= default的话两种方案等价，此时用block local static更方便

Faster asin() Was Hiding In Plain Sight

作者在ray tracer里折腾asin()近似，先用Taylor级数搞了个四阶近似，边界精度差要fallback到std::asin()，只快了5%。然后研究Padé Approximant想进一步优化——结果并没有实质提升

但是！ 问了一下LLM，Gemini给了个来自Nvidia Cg Toolkit的实现（源自1960年代Abramowitz & Stegun数学手册的Formula 4.4.45）。代码极其简单，无分支，精度极高：

double fast_asin_cg(const double x) {
    constexpr double a0 =  1.5707288;
    constexpr double a1 = -0.2121144;
    constexpr double a2 =  0.0742610;
    constexpr double a3 = -0.0187293;

    const double abs_x = fabs(x);
    double p = a3 * abs_x + a2;
    p = p * abs_x + a1;
    p = p * abs_x + a0;

    const double x_diff = sqrt(1.0 - abs_x);
    const double result = HalfPi - (x_diff * p);

    return copysign(result, x);
}

benchmark结果（Apple M4, GCC15）：

std::asin()：111秒
Taylor近似：105秒（~5%）
Cg近似：101秒

Intel i7上差距更大，MSVC下快1.9倍

教训：先查查别人有没有解决过。一个藏在2012年已停更软件文档里的公式，源头甚至是60年代的数学手册。花里胡哨搞半天不如先搜一搜

Some fixes and improvements in GCC

GCC 16的一些改进：

-fcondition-coverage和-fpath-coverage现在自动隐含-ftest-coverage，不用手动加了
修了gcov-dump偏移量错误的bug
MC/DC分析性能大提升：分析多条件表达式时寻找mask的算法从暴力搜所有candidate改成从左相邻操作数开始搜索。对于单个长&&链：

before: 20822.303 ms (41.645 ms per expression)
after:   1288.548 ms ( 2.577 ms per expression)

15-20倍加速。算法重要啊

The hidden compile-time cost of C++26 reflection

Vittorio Romeo测了一下C++26 reflection对编译时间的影响（GCC 16, Fedora 44, i9-13900K）

关键数据：

-freflection flag本身：零开销（33.2ms → 33.9ms）
#include <meta>：+155ms
反射1个struct：~13.1ms增量，10个struct：~1.1ms/个
Barry的AoS-to-SoA完整例子：818.9ms，去掉<print>后：310.2ms

反射本身不慢，标准库头文件才是瓶颈（<print>一个头文件就+508ms）。用PCH可以大幅缓解

用modules的话目前跟PCH比还有差距（小头文件PCH更快，大头文件差不多）

结论： 如果<meta> + <ranges>成为标杆用法，每个TU最少~310ms编译开销。PCH/modules基本上是必须的

（注：文章初版用了开启了assertion的GCC构建，后来更正了，现在数据是release build的）

Exploring Mutable Consteval State in C++26

friedkeenan在Barry那篇meta::substitute文章的启发下，用friend injection + C++26 reflection实现了consteval mutable variable

核心想法：用substitute触发friend injection插入全局状态，每次insert对应一个唯一的index_t<N>类型：

template<typename Tag>
struct consteval_state {
    template<std::size_t>
    struct index_t {
        friend constexpr auto consteval_state_value(index_t) -> auto;
    };

    template<std::size_t Index, auto Value>
    struct store_consteval_state {
        friend constexpr auto consteval_state_value(index_t<Index>) -> auto {
            return Value;
        }
    };
};

然后用一个超级诡异的trick检测状态是否已插入——利用C++标准规定函数参数被重新声明为不同名字后has_identifier返回false这一行为来做探测

最终实现了expand_loop，类似Barry提出的假设性template for (consteval mutable int i = 0; ...)语法：

expand_loop([]<auto Loop>() {
    std::printf("LOOP: %zu\n", Loop.index());

    consteval {
        if (Loop.index() >= 3) {
            Loop.push_break();
        }
    }
});

Compiler Explorer demo

黑魔法程度爆表，但确实能跑。作者也说了，这种library code可以为未来的语言提案积累经验

视频

Persistence squared: persisting persistent data structures - Juan Pedro Bolívar Puente - Meeting C++

讲的是 immer 库——C++下的持久化/不可变数据结构，以及在此基础上的序列化库 immer::persist

Part I: 持久数据结构基础

#include <immer/vector.hpp>

const auto a = immer::vector<int>{1, 2, 3};
const auto b = a.push_back(4);
const auto c = b.set(0, 42);
assert(a.size() == 3 && b.size() == 4);
assert(a[0] == 1 && c[0] == 42);

底层用 Radix Balanced Tree（B=5, M=32），修改只复制根到叶的路径，其余节点 structural sharing。push_back/set 都是 effective O(1)

value semantics的好处——并发安全、compose良好：

// value semantics do compose
document complex_operation(document doc) {
    doc.foo = 42;
    doc.bar = 69;
    return doc;
}

auto curr = document{};
auto next = complex_operation(curr);
// curr remains valid, it "persisted"!
// also: comparing vectors is cheap!
if (curr.foo != next.foo) {
    // use Lager cursors for extra convenience
}

HAMT还支持高效diff：

const auto a = immer::map<int, std::string>{
    {0, "foo"}, {1, "bar"}};
const auto b = a
    .set(2, "baz")
    .update(1, [](auto x) { return x + x; });

// diff in O(|change|), not O(n)!
immer::diff(a, b,
    [](auto x)          { print("added: {}", x); },
    [](auto x)          { print("removed: {}", x); },
    [](auto x, auto y)  { print("changed: {} -> {}", x, y); });

Part II: 序列化的问题——如何保留结构共享？

直接用Cereal序列化丢失共享：

const auto v1 = vector<string>{"a", "b", "c", "d"};
const auto v2 = v1.push_back("e").push_back("f");
const auto v3 = v2;
const auto hist = vector<document>{// fuck jekyll renderer
{v1}, {v2}, {v3}};

普通序列化输出：v1被序列化3次（v2/v3各自独立拷贝），共享关系丢失

{ "value0": [
    { "data": ["a", "b", "c", "d"] },
    { "data": ["a", "b", "c", "d", "e", "f"] },
    { "data": ["a", "b", "c", "d", "e", "f"] }
]}

immer::persist 的做法——把节点池单独序列化，vector只存pool中的ID：

using namespace immer::persist;
cereal_save_with_pools<cereal::JSONOutputArchive>(
    std::cout, hist,
    hana_struct_auto_member_name_policy(hist));

输出变成：

{ "value0": [{ "data": 0 }, { "data": 1 }, { "data": 1 }],
  "pools": { "data": {
    "B": 1, "BL": 1,
    "leaves": [
      [ 1, [ "c", "d" ] ],
      [ 2, [ "a", "b" ] ],
      [ 4, [ "e", "f" ] ]
    ],
    "inners": [
      [ 0, { "children": [ 2 ], "relaxed": false }],
      [ 3, { "children": [ 2, 1 ], "relaxed": false }]
    ],
    "vectors": [
      { "root": 0, "tail": 1 },
      { "root": 3, "tail": 4 }
    ]
  }}
}

v2和v3指向同一个pool entry（ID=1），共享的leaf节点 ["a","b"] 和 ["c","d"] 只存一份

版本迁移也保留结构共享：

namespace v1 { struct document { vector<std::string> data; }; }
namespace v2 { struct document { vector<char> data; }; }

// 定义转换：string -> char (取首字符)
const auto xform = hana::make_map(
    hana::make_pair(
        hana::type_c<vector<std::string>>,
        [] (std::string v) -> char {
            return v.empty() ? '\0' : v[0];
        }));

// pool级别的转换，只变换实际不同的节点
const auto p0 = get_output_pools(history);
auto p1 = transform_output_pool(p0, xform);

auto view = history | std::views::transform([&](v1::document x) {
    auto data = convert_container(p0, p1, x.data);
    return v2::document{data};
});

Slides: https://sinusoid.es/talks/cppcon25/

值得一看，做undo/time-travel/并发数据结构的可以了解一下

Practical Reflection With C++26 - Barry Revzin - CppCon 2025

上面那几篇reflection文章的配套演讲。Barry亲自演示用C++26 reflection实现一个 Struct-of-Arrays（SoA）容器——给任意aggregate类型自动生成SoA布局

比如你有：

struct Particle {
    float x, y, z;
    float vx, vy, vz;
    int type;
};

// 传统 AoS: vector<Particle>，内存布局是 x,y,z,vx,vy,vz,type,x,y,z,...
// SoA: 每个field一个vector，遍历单个field时cache line利用率拉满

用reflection自动生成一个 SoA<Particle> 容器，大致思路：

template <typename T>
struct SoA {
    // 用 reflection 遍历 T 的所有 nonstatic data members
    // 对每个 member 生成一个 vector<member_type>
    struct storage {
        template for (constexpr auto mem : nonstatic_data_members_of(^^T)) {
            std::vector<[:type_of(mem):]> [:mem:];
        }
    } data;

    void push_back(const T& val) {
        template for (constexpr auto mem : nonstatic_data_members_of(^^T)) {
            data.[:mem:].push_back(val.[:mem:]);
        }
    }
};

内部存的是 vector<float>, vector<float>, ..., vector<int>，而不是 vector<Particle>。对cache友好度提升巨大

演讲中间还穿插了不少实用的reflection技巧 detour，属于reflection实战的必看内容

Missing (and future?) C++ range concepts - Jonathan Müller - Meeting C++ 2025

Jonathan Müller（think-cell）探讨当前 std::ranges concept体系的不足和未来方向

C++20 ranges目前缺什么？

compile-time-sized ranges —— 比如 std::array，你知道编译期大小但现在没有concept来约束
approximately sized ranges —— 比如filter后的range，你大概知道它不会超过原来的大小，但 sized_range 要求精确size
infinite ranges —— iota(0) 这种无限range，目前概念体系处理得不好
noncontiguous ranges with contiguous chunks —— 比如 deque，整体不连续但每个chunk是连续的，现在没法表达这种”分段连续”

更激进的方向：

push-based for_each_while(rng, sink) 替代 pull-based iterator，让range不再被iterator约束。这打开了一个全新的世界：

heterogeneous ranges —— range的每个元素类型不同（比如 std::tuple）
用普通的range算法做 type-based metaprogramming
高效处理 polymorphic types 的range

思路很前瞻，虽然离标准化还远，但方向很有意思。感兴趣的可以点进去看看

The promise of static reflection in C++26: Type Traits without compiler intrinsics - Andrei Zissu

Andrei Zissu展示C++26 static reflection的一个具体应用：用reflection重新实现type traits，干掉编译器intrinsics

目前标准库的type traits（is_class, is_enum, is_union等）底层都依赖编译器magic——__is_class, __is_enum这些intrinsics。每个编译器实现不同，不可移植，也不透明

有了static reflection，这些traits可以用纯库代码实现：

// 以前：依赖编译器 intrinsic __is_class
template <typename T>
struct is_class : bool_constant<__is_class(T)> {};

// 现在：用 reflection 查询 type 的 kind
template <typename T>
constexpr bool is_class_v =
    type_is_class(^^T);  // or: reflect_kind(^^T) == kind::class_type

template <typename T>
constexpr bool is_enum_v =
    type_is_enum(^^T);

// 甚至可以组合出更复杂的 trait
template <typename T>
constexpr bool has_virtual_destructor_v =
    is_virtual(^^T::~T);

好处：

编译器/库解耦，traits实现变得portable
新增trait不再需要编译器团队配合加intrinsic
用户也可以自己写”自定义trait”

又一个reflection真香案例。跟上面Barry的演讲对着看效果更佳

Instruction Level Parallelism and Software Performance - Ivica Bogosavljevic - Meeting C++ 2025

Ivica Bogosavljevic（Johnny’s Software Lab 博主）讲ILP（指令级并行）对性能的影响

什么是ILP？

现代CPU乱序执行，如果当前指令卡住了（比如等内存），CPU会找后面不依赖它的指令先执行。代码依赖链越短，ILP越高，CPU越能并行执行

三种典型场景：

High ILP——迭代间无依赖，CPU可以流水线式推进：

for (int i = 0; i < n; i++) {
    c[i] = a[i] + b[i];
}

Low ILP但load无依赖——sum有跨迭代依赖，但load可以提前发射：

sum = 0;
for (int i = 0; i < n; i++) {
    sum += a[i];
}

Low ILP且load有依赖链——灾难性，每次load都依赖上一次的结果：

sum = 0;
while (current != 0) {
    sum += current->val;
    current = current->next; // 必须等上一次load完成！
}

链表/树/hash map with separate chaining 都属于这种情况

加速技巧1：Interleaving——同时处理N个查询

不要一个一个地在链表里查找，而是一次遍历同时对N个值做比较：

current_node = head;  
while (current_node) {
    for (int i = 0; i < N; i++) {
        if (values[i] == current_node->value) {
            result[i] = true;
            break;
        }
    }
    current_node = current_node->next;
}

二叉树同理——N个查找同时进行，共享树的遍历：

// 所有N个查找从root开始
for (int i = 0; i < N; i++) {
    current_nodes[i] = root;
}
do {
    not_null_count = 0;
    for (int i = 0; i < N; i++) {
        if (current_nodes[i] != nullptr) {
            NodeType* node = current_nodes[i];
            if (values[i] < node->value) {
                current_nodes[i] = node->left;
            } else if (values[i] > node->value) {
                current_nodes[i] = node->right;
            } else {
                result[i] = true;
                current_nodes[i] = nullptr;
            }
            not_null_count++;
        }
    }
} while (not_null_count > 0);

内层循环的N次比较是独立的，CPU可以并行执行load。前几层完全命中L1 cache

加速技巧2：Breaking Pointer Chains——消除指针追逐

链表：先建index数组，后续查找用数组遍历替代链表遍历：

// 创建index数组
node* current_node = head;
while (current_node != nullptr) {
    index_vector.push_back(current_node);
    current_node = current_node->next;
}

// 用数组遍历替代链表遍历
for (int i = 0; i < values.size(); i++) {
    for (int j = 0; j < index_vector.size(); j++) {
        if (index_vector[j]->value == values[i]) {
            result[i] = true;
            break;
        }
    }
}

二叉树：存成数组，左子 2*i+1 右子 2*i+2，地址可以算出来不需要追指针

实验数据：

interleaved版本虽然指令数多了1.8x，但CPI从10.6降到2.77，总体快了2倍。典型的”多做事反而更快”

链表实验更夸张——当25%节点随机错位时：

Simple: 53.9s, CPI=13.16
Interleaved: 1.26s, CPI=0.38 —— 快了42倍

代码仓库：https://github.com/ibogosavljevic/johnysswlab/tree/master/2022-06-instrucionlevelparallelism

博客原文：https://johnnysswlab.com/instruction-level-parallelism-in-practice-speeding-up-memory-bound-programs-with-low-ilp/

做性能优化的必看，尤其是经常跟链表/树/hash map打交道的

开源项目介绍

asteria 一个脚本语言，可嵌入，长期找人，希望胖友们帮帮忙，也可以加群753302367和作者对线

上一期

本期

下一期

看到这里或许你有建议或者疑问或者指出错误，请留言评论! 多谢! 你的评论非常重要！也可以帮忙点赞收藏转发！多谢支持！

觉得写的不错那就给点吧, 在线乞讨

This site is open source. Improve this page.