C++ 中文周刊 2025-08-11 第188期

周刊项目地址

公众号

点击「查看原文」跳转到 GitHub 上对应文件,链接就可以点击了

qq群 753792291 答疑在这里

RSS

欢迎投稿,推荐或自荐文章/软件/资源等,评论区留言

最近有点忙,本周刊成功升级成月刊


资讯

标准委员会动态/ide/编译器信息放在这里

编译器信息最新动态推荐关注hellogcc公众号 本周更新 2025-01-08 第288期

性能周刊

文章

how can I read more than 4GB of data from a file in a single call to Read­File?

Read­File参数 nNumber­Of­Bytes­To­Read上限4G,大于4G的文件怎么读?自己拆成多个小于4G来读,历史原因,就这样不改了

std::format improvements (Part 2)

问题1: api割裂,std::format 支持编译期检查,之后运行时只好使用vformat

改进: 引入特化runtime_format

// 改进前
std::vformat(str, std::make_format_args(42));

// 改进后:更清晰的意图表达
std::format(std::runtime_format(str), 42);

问题2 参数UB

std::string str = "{}";
auto args = std::make_format_args(path.string());  // 临时对象被移动
std::string msg = std::vformat(str, args);        // 使用已销毁的临时对象

改进: 只支持左值

问题3 char符号问题

for (char c : "\U0001F937") {  // Unicode 字符 "🤷"
  std::print("\\x{:02x}", c);  // 可能输出 \x-10 或 \xf0
}

改进: 特化char formatter

// 内部实现修改示例
template<>
struct formatter<char> {
    auto format(char c, auto& ctx) {
        return formatter<unsigned char>::format(+c, ctx);  // 显式转换为无符号
    }
};

问题4 换行符

改进

std::cout << std::format("{:?}", path);  // "multi\nline"

目前还没有编译器实现

Format your own type (Part 1)

自定义 偏特化formater

#include <format>
#include <iostream>
#include <string>

struct ProgrammingLanguage {
    std::string name;
    int version{0};
};

template <>
struct std::formatter<ProgrammingLanguage> {
    std::formatter<std::string> _formatter;
    constexpr auto parse(std::format_parse_context& parse_context) {
        return _formatter.parse(parse_context);
    }

    auto format(const ProgrammingLanguage& programming_language,
                std::format_context& format_context) const {
        std::string output = std::format("{}{}", programming_language.name,
                                         programming_language.version);
        return _formatter.format(output, format_context);
    }
};

int main() {
    ProgrammingLanguage cpp{"C++", 20};
    std::cout << std::format("{} is fun", cpp) << '\n';
}

这只是简单能力,如何实现分析具体内容来格式化输出?比如我想根据指定的属性,来定义格式化

需要定制parse

template <>
struct std::formatter<ProgrammingLanguage> {
    std::string _attributes;
    constexpr auto parse(std::format_parse_context& parse_context) {
        auto it = std::ranges::find(parse_context, '}');
        _attributes = std::string(parse_context.begin(), it);
        return it;
    }

    auto format(const ProgrammingLanguage& programming_language,
                std::format_context& format_context) const {
        auto out = format_context.out();
        if (_attributes.empty()) {
            out = std::format_to(out, "{}{}", programming_language.name,
                                 programming_language.version);
            return out;
        }
        for (auto n = 0u; n <= _attributes.size(); ++n) {
            if (_attributes[n] == '%') {
                switch (_attributes[++n]) {
                    case 'n':
                        out = std::format_to(out, "{}",
                                             programming_language.name);
                        break;
                    case 'v':
                        out = std::format_to(out, "{}",
                                             programming_language.version);
                        break;
                    case '%':
                        out = std::format_to(out, "%");
                        break;
                }
            } else {
                out = std::format_to(out, "{}", _attributes[n]);
            }
        }
        return out;
    }
};

这样就支持 %n %v

完整代码在这里

Format your own type (Part 2)

接着上面的改动,丰富一下parser

#include <format>
#include <optional>
#include <string>

struct ProgrammingLanguage {
    std::string name;
    int major_version{0};
    std::optional<int> minor_version{};
    std::optional<int> patch_version{};
};

template <>
struct std::formatter<ProgrammingLanguage> {
    std::string _attributes;
    
    constexpr auto parse(std::format_parse_context& ctx) {
        auto it = std::ranges::find(ctx, '}');
        _attributes = std::string(ctx.begin(), it);
        return it;
    }

    auto format(const ProgrammingLanguage& pl, std::format_context& ctx) const {
        auto out = ctx.out();
        if (_attributes.empty()) {
            return std::format_to(out, "{}{}", pl.name, pl.major_version);
        }
        for (size_t n = 0; n < _attributes.size(); ++n) {
            if (_attributes[n] == '%') {
                ++n;
                switch (_attributes[n]) {
                    case 'l': 
                        std::format_to(out, "{}", pl.name); 
                        break;
                    case 'v': {
                        std::format_to(out, "{}", pl.major_version);
                        if (auto m = pl.minor_version) 
                            std::format_to(out, ".{}", *m);
                        if (auto p = pl.patch_version) 
                            std::format_to(out, ".{}", *p);
                        break;
                    }
                    case 'm': 
                        std::format_to(out, "{}", pl.major_version); 
                        break;
                    case 'n': 
                        std::format_to(out, "{}", *pl.minor_version); 
                        break;
                    case 'p': 
                        std::format_to(out, "{}", *pl.patch_version); 
                        break;
                }
            } else {
                std::format_to(out, "{}", _attributes[n]);
            }
        }
        return out;
    }
};

Data alignment for speed: myth or reality?

在主流的Intel和64位ARM处理器上,数据对齐对性能的提升作用被高估。作者认为这属于微观优化,性能测试无明显差异

除非涉及:

3 C++ Lambdas Myths

几个误解

There is a std::chrono::high_resolution_clock, but no low_resolution_clock

低精度时钟实现

// Windows 平台专用实现
struct cheap_steady_clock {
    using rep = int64_t;              // 支持负时间差
    using period = std::milli;        // 毫秒单位
    using duration = std::chrono::duration<rep, period>;
    using time_point = std::chrono::time_point<cheap_steady_clock>;

    static constexpr bool is_steady = true;  // 保证单调递增

    static time_point now() noexcept {
        return time_point{ duration{ static_cast<int64_t>(GetTickCount64()) } };
    }
};

// Linux 平台专用实现
struct cheap_steady_clock {
    using duration = std::chrono::nanoseconds;  // 保持纳秒类型兼容性
    using rep = duration::rep;
    using period = duration::period;
    using time_point = std::chrono::time_point<cheap_steady_clock>;

    static constexpr bool is_steady = true;

    static time_point now() noexcept {
        struct timespec tp;
        if (0 != clock_gettime(CLOCK_MONOTONIC_COARSE, &tp))
            throw system_error(errno, "clock_gettime failed");
        return time_point(std::chrono::seconds(tp.tv_sec) + std::chrono::nanoseconds(tp.tv_nsec));
    }
};

我见过一种玩法,用一个线程更新时间戳,精度更低,不知道和这个比性能差异啥样子

Avoiding Undefined Behaviour with BoostTests and standard types

向std命名空间或其嵌套命名空间添加声明,在绝大多数情况下都属于未定义行为。

极少数例外情况(如为自定义类型特化std::hash)是被允许的

namespace std {
    template <typename T>
    std::ostream &operator<<(std::ostream &os, const optional<T> &value) {
        if (value.has_value()) {
            return os << "std::optional{" << value.value() << "}";
    }
        return os << "std::nullopt";
    }
} // namespace std

这么玩是UB

boost绕了一下

namespace boost::test_tools::tt_detail {
template <typename T>
struct print_log_value<std::optional<T>> {
  void operator()(std::ostream &os, const std::optional<T> &value) {
    if (value.has_value()) {
      os << "std::optional{";
      print_log_value<T>()(os, *value);
      os << "}";
    } else {
      os << "std::nullopt";
    }
  }
};
}  // namespace boost::test_tools::tt_detail

总得有人干脏活

How can I wait until a named object (say a mutex) is created?

windows Mutex可以用做多个进程互斥工具,但是没有很好的区分崩溃推出/无响应/多进程同时启动下场景

推荐共享内存+版本号,或者com 组件

What happens if C++/WinRT is unable to resume execution on a dispatcher thread?

Being more adamant about reporting that C++/WinRT was unable to resume execution on a dispatcher thread

C++/WinRT 库提供了 winrt::resume_foreground()函数,用于在调度器的前台线程上恢复协程。它支持三种调度器类型:

​​问题场景​​:当线程切换失败时会发生什么?

这种情况可能发生在调度器线程已关闭或不存在时,无法接受新任务

​- DispatcherQueue 的情况​​ - 调用 TryEnqueue()返回 false时,winrt::resume_foreground()会将该返回值作为 co_await的结果传播。

开发者需检查返回值以判断线程切换是否成功,但实践中几乎无人检查返回值,导致代码误以为在调度器线程上运行,实则可能运行在错误线程

// 示例代码:未检查返回值的典型问题
if (co_await winrt::resume_foreground(dispatcherQueue)) {
    // 理论上应在此处执行 UI 操作
} else {
    // 实际开发中极少处理此分支
}

后续在wil库中解决

如果替换成wil,代码这样

try {
    co_await wil::resume_foreground(dispatcher);
    // 执行 UI 操作
} catch (winrt::hresult_error const& ex) {
    if (ex.code() == HRESULT_FROM_WIN32(ERROR_NO_TASK_QUEUE)) {
        // 处理线程不可用场景
    } else {
        // 处理其他错误
    }
}

CUDA Pro Tip: Increase Performance with Vectorized Memory Access

利用批量load/store提高内存带宽利用率

原来的代码

__global__ void device_copy_scalar_kernel(int* d_in, int* d_out, int N) { 
  int idx = blockIdx.x * blockDim.x + threadIdx.x; 
  for (int i = idx; i < N; i += blockDim.x * gridDim.x) { 
    d_out[i] = d_in[i]; 
  } 
}

cuobjdump看指令

LDG.E R3, desc[UR6][R2.64] ;  // 加载32位数据
STG.E desc[UR6][R4.64], R3 ;  // 存储32位数据

考虑使用int2,两个int拼起来

__global__ void device_copy_vector2_kernel(int* d_in, int* d_out, int N) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  for(int i = idx; i < N/2; i += blockDim.x * gridDim.x) {
    reinterpret_cast<int2*>(d_out)[i] = reinterpret_cast<int2*>(d_in)[i];
  }
  // 处理奇数余数
  if (idx==N/2 && N%2==1)
    d_out[N-1] = d_in[N-1];
}

汇编

LDG.E.64 R2, desc[UR4][R2.64] ;  // 64位加载
STG.E.64 desc[UR4][R4.64], R2 ;  // 64位存储

4个int拼起来

__global__ void device_copy_vector4_kernel(int* d_in, int* d_out, int N) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  for(int i = idx; i < N/4; i += blockDim.x * gridDim.x) {
    reinterpret_cast<int4*>(d_out)[i] = reinterpret_cast<int4*>(d_in)[i];
  }
  // 处理余数
  int remainder = N%4;
  if (idx==N/4 && remainder!=0) {
    while(remainder--) {
      int idx = N - remainder;
      d_out[idx] = d_in[idx];
    }
  }
}

汇编

LDG.E.128 R4, desc[UR4][R4.64] ;  
STG.E.128 desc[UR4][R8.64], R4 ;

利用率越来越高

优先使用向量化​​提升带宽利用率并减少指令开销,不过会增加寄存器压力,看取舍吧,一般情况有收益

点数转换的神奇做法 - ZZYSonny的文章 - 知乎

感谢yexuanxiao投稿

SIMD(向量化)函数的混乱现实

openmp,编译期实现,库实现,太乱了

还有各种数据访问模式

类型 说明 示例场景
Variable 各通道数据独立(默认 计算任意4列 [2,11,13,4]
uniform 所有通道数据相同 图像宽/高、固定列号
linear 通道数据呈线性序列(如0,1,2,3) 连续列求和:[0,1,2,3]
// 计算4个连续列(如0,1,2,3)→ linear(column)
#pragma omp declare simd uniform(img_ptr, width, height) linear(column)
double sum_column(double const* img, size_t col, size_t w, size_t h);

// 计算任意4列(如2,11,13,4)→ 所有参数variable(默认)
#pragma omp declare simd
double sum_column(double const* img, size_t col, size_t w, size_t h);

还有分支计算

// 掩码版函数定义示例
extern "C" __m256d _ZGVdM4v__Z6squared(__m256d x, __m256d mask) {
    __m256d r = _mm256_mul_pd(x, x);
    return _mm256_blendv_pd(r, x, mask); // 按掩码混合结果
}

还有对齐指令等

openmp编译器实现也不完整,优化也不到位,代码生成有可能低效

还有各种修饰问题

后面是作者手写函数伪装成编译器生成,来打patch

首先 声明 编译器生成的向量函数名遵循 _ZGV{ABI}{Mask}{Lanes}{Params}__Z{Orig} 格式:

示例:_ZGVdN4v__Z6squared

然后覆盖

// 标量版(必需)
double square(double x) { return x*x; }

// 非掩码向量版(AVX2, 4通道)
extern "C" __m256d _ZGVdN4v__Z6squared(__m256d x) {
    return _mm256_mul_pd(x, x);
}

// 掩码向量版(AVX2, 4通道)
extern "C" __m256d _ZGVdM4v__Z6squared(__m256d x, __m256d mask) {
    __m256d r = _mm256_mul_pd(x, x);
    return _mm256_blendv_pd(r, x, mask);
}

继续处理复杂参数

// 声明:uniform(图像指针/宽/高) + linear(列号)
#pragma omp declare simd uniform(img_ptr, width, height) linear(column) notinbranch
double sum_column(double const* img, size_t col, size_t w, size_t h);

// 定义:uniform参数传标量,linear参数传起始值+步长
extern "C" __m256d _ZGVdN4uluu__Z10sum_columnPKdmmm(
    double const* img,  // uniform → 标量
    size_t col,         // linear → 起始列号
    size_t w,           // uniform → 标量
    size_t h            // uniform → 标量
) { ... }

最后规避编译器问题

accu专栏

Trip report: C++ On Sea 2025

介绍了一些他参与会议感兴趣的演讲,比如不提倡刷题(我支持),

比如constexpr_error_str增加编译期间报错信息

比如测试不好写是代码设计有问题

后面ppt出了,过一下

What’s THIS for?

讲了历史包袱需要回顾学习,很多信息都是元信息,总得学,持续学,喝了一碗鸡汤

Local Reasoning Can Help Prove Correctness

讲了个概念Local Reasoning, 可以理解成最小代码片验证

Simple Compile-Time Dynamic Programming in Modern C++

写了个constexpr dp

代码在这里 https://godbolt.org/z/9a9TP6aos https://godbolt.org/z/9a9TP6aos

// Function to compute MCM using Bottom-Up DP
// (constexpr)
template <std::size_t N>
constexpr std::array<std::array<int, N>, N> 
  matrix_chain_dp(const std::array<int, N + 1>& 
  dims)
{
  std::array<std::array<int, N>, N> dp{};
  std::array<std::array<int, N>, N> split{};
  // Initialize dp table for chains of length
  // 1(have zero cost) since no multiplication
  for (std::size_t i = 0; i < N; ++i) 
  {
    dp[i][i] = 0;
  }
  // Bottom-Up DP computation
  for (std::size_t len = 2; len <= N; ++len)
  // iterate over chain length
  { // Chain length
    for (std::size_t i = 0; i < N - len + 1; ++i)
    // iterate over chain start position
    {
      std::size_t j = i + len - 1;
      dp[i][j] = std::numeric_limits<int>::max();
      for (std::size_t k = i; k < j; ++k) 
      // iterate over split position
      {
        // optimal cost  = cost of left chain 
        // + cost of right chain + cost of split
        int cost = dp[i][k] + dp[k + 1][j] 
          + dims[i] * dims[k + 1] * dims[j + 1];
        if (cost < dp[i][j]) 
        {
          dp[i][j] = cost; // store lowest cost
          split[i][j] = k; //store optimal split
        }
      }
    }
  }
  return split;
}

UI Development with BDD and Approval Testing

感觉是想用脚本dsl编排生成测试用例,老外不用按键精灵吗

AI Powered Healthcare Application

AI问诊,偏远地区患者难以获得及时、个性化的医疗建议,易依赖不可靠在线信息,导致健康风险,用gpt拼一个

感觉搞不转,责任担不起

将JSON反射到C++对象

直接贴代码了 https://godbolt.org/z/Kn5b46T8j

加上注释

#include <experimental/meta>
#include <charconv>
#include <string>
#include <algorithm>

// 通过define_aggregate在编译期生成匿名结构体Inner
template <std::meta::info ...Ms>
struct Outer {
    struct Inner;
    consteval {
        define_aggregate(^^Inner, {Ms...});
    }
};

template <std::meta::info ...Ms>
using Cls = Outer<Ms...>::Inner;

// 推导构造
template <typename T, auto ... Vs>
constexpr auto construct_from = T{Vs...};

consteval std::meta::info parse_json(std::string_view json) {
    // 状态机实现:
    // - 跳过空白字符
    // - 解析键值对
    // - 处理字符串/数值/嵌套对象
    // - 递归解析嵌套结构
  auto cursor = json.begin();
  auto end = json.end();

  auto is_whitespace = [](char c) {
    return c == ' ' || c == '\n' || c == '\t';
  };

  auto skip_whitespace = [&] -> void {
    while (is_whitespace(*cursor)) cursor++;
  };

  auto expect_consume = [&] (char c) -> void {
    skip_whitespace();
    if (cursor == end || *(cursor++) != c) throw "unexpected character";
  };

  auto parse_until = [&](std::vector<char> delims, std::string &out) -> void {
    skip_whitespace();
    while (cursor != end &&
           !std::ranges::any_of(delims, [&](char c) { return c == *cursor; }))
      out += *(cursor++);
  };

  auto parse_delimited = [&](char lhs, std::string &out, char rhs) -> void {
    skip_whitespace();
    expect_consume(lhs);
    parse_until({rhs}, out);
    expect_consume(rhs);
  };

  auto parse_value = [&](std::string &out) -> void {
    skip_whitespace();

    bool quoted = false;
    unsigned depth = 0;
    while (true) {
      if (cursor == end) throw "unexpected end of stream";
      if (is_whitespace(*cursor) && !quoted && depth == 0)
        break;

      if (depth == 0 && (*cursor == ',' || *cursor == '}'))
        break;
      out += *(cursor++);

      if (out.back() == '{')
        ++depth;
      else if (out.back() == '}')
        --depth;
      else if (out.back() == '"') {
        if (quoted && depth == 0)
          break;
        quoted = true;
      }
    };
  };

  skip_whitespace();
  expect_consume('{');

  std::vector<std::meta::info> members;
  std::vector<std::meta::info> values = {^^void};

  using std::meta::reflect_constant, std::meta::reflect_constant_string;
  while (cursor != end && *cursor != '}') {
    std::string field_name;
    std::string value;

    std::meta::info parsed_type;

    parse_delimited('"', field_name, '"');
    expect_consume(':');
    parse_value(value);

    if (value.empty()) throw "expected value";
    if (cursor == end) throw "unexpected end of stream";

    if (value[0] == '"') {
      if (value.back() == '}' && value[1] == 'f') throw "hmm";
      if (value.back() != '"') throw "expected end of string";
      std::string_view contents(&value[1], value.size() - 2);

      auto dms = data_member_spec(^^char const*, {.name=field_name});
      members.push_back(reflect_constant(dms));
      values.push_back(reflect_constant_string(contents));
    } else if (value[0] >= '0' && value[0] <= '9') {
      int contents = [](std::string_view in) {
        int out;
        std::from_chars(in.data(), in.data() + in.size(), out);
        return out;
      }(value);

      auto dms = data_member_spec(^^int, {.name=field_name});
      members.push_back(reflect_constant(dms));
      values.push_back(reflect_constant(contents));
    } else if (value[0] == '{') {
      std::meta::info parsed = parse_json(value);

      auto dms = data_member_spec(type_of(parsed), {.name=field_name});
      // 收集成员,上同
      members.push_back(reflect_constant(dms));
      values.push_back(parsed);
    }
    skip_whitespace();
    if (cursor != end && *cursor == ',')
      ++cursor;
  }
  if (cursor == end) throw "hmm";
  expect_consume('}');
  //  替换类型模板参数
  values[0] = substitute(^^Cls, members);
  // 实例化对象
  return substitute(^^construct_from, values);
}

struct JSONString {
    std::meta::info Rep;
    consteval JSONString(const char *Json) : Rep{parse_json(Json)} {}
};

template <JSONString json>
consteval auto operator""_json() {
    return [:json.Rep:];
}

template <JSONString json>
inline constexpr auto json_to_object = [: json.Rep :];

#include <print>

constexpr const char data[] = {
    #embed "test.json"
    , 0
};

int main() {
  constexpr auto v = json_to_object<data>;
  static_assert(std::string_view(v.outer) == "text");
  static_assert(v.inner.number == 2996);
  static_assert(std::string_view(v.inner.field) == "yes");
  std::println("field: {}, number: {}", v.inner.field, v.inner.number);

  static_assert(R"({"field": "yes", "number": 2996})"_json.number == 2996);
}

A Library Approach to Constant Template Parameters

写了一个库, https://github.com/brevzin/ctp,用来封装nttp参数,通过反射来转成对象,这样nttp就真的是元信息了,而不是傻傻的字符串

序列化​​:将类型实例转换为 std::meta::info的序列,定义其唯一性。 ​​反序列化​​:根据序列重建静态存储的对象,确保不同编译单元中相同值的对象地址一致。

ctp::Param<\T\>封装类型 T的反射表示,提供 get()方法返回静态引用:

namespace ctp {
  template <class T>
  struct Param {
    T const& object;  // 静态存储的对象引用
    consteval Param(T const& v) : object(define_static_object(v)) {}
  };
}
template <>
struct Reflect<std::string> {
  using target_type = std::string_view;
  static consteval auto serialize(std::string const& s) {
    return {std::meta::reflect_constant_string(s)};  // 序列化为字符串字面量反射
  }
  static consteval auto deserialize(std::meta::info r) {
    return std::string_view(extract<char const*>(r));  // 反序列化为 string_view
  }
};

类型等价性保障​​

使用 std::meta::substitute和 std::meta::reflect_constant_array确保不同序列化方式生成相同的编译时常量对象。

字符串处理​​

​​容器类型(如 std::vector)​​

​​可选类型(std::optional)与变体(std::variant)​​

用户怎么用?

struct MyType { int a; };
template <>
struct Reflect<MyType> {
  using target_type = MyType;
  static consteval auto serialize(MyType const& t) {
    return {std::meta::reflect_constant(t.a)};
  }
  static consteval auto deserialize(std::meta::info r) {
    return MyType{extract<int>(r)};
  }
};

template <ctp::Param<MyType> T>
struct MyClass { /* ... */ };

how to break or continue from a lambda loop?

假设我们需要存储一组Entity对象,并跟踪它们的存活状态。内部可能使用std::vector<std::optional>实现:

我们希望用户遍历有效实体,但不暴露底层实现细节(如std::optional或std::vector)。直接暴露数据结构不可行。

尝试方案与局限

这里引入controlflow状态, 吧lambda穿起来

enum class ControlFlow  
{  
    Continue,  
    Break  
};  

void EntityStorage::forEntities(auto&& f)  
{  
    for (auto& optEntity : m_entities)  
        if (optEntity.has_value())  
            if (f(*optEntity) == ControlFlow::Break)  
                break;  
}

es.forEntities([&](const Entity& e)  
{  
    if (shouldSkip(e))  
        return ControlFlow::Continue;  
    if (shouldBreak(e))  
        return ControlFlow::Break;  
    // 处理逻辑...  
    return ControlFlow::Continue;  
});

也可以引入constexpr剪分支,以及扩展controlflow枚举,不展开了

C++简单类型的拥有式类型擦除技术

简单类型定义

核心是利用C++23中标准化的”隐式生命周期”(implicit lifetime)概念(P0593)。关键点包括:

例子

class shape {
    alignas(size_alignment) unsigned char data[size_alignment]{};
    typedef void draw_fun(const unsigned char* buffer);
    draw_fun* ptr;

public:
    template <typename T>
    shape(const T& t) noexcept :
        ptr(+[](const unsigned char* buffer){
            auto obj = std::launder(reinterpret_cast<const T*>(buffer));
            obj->draw();
        })
    {
        // 一系列static_assert确保T是简单类型
        std::memcpy(data, &t, sizeof(T));
    }

    void draw() const {
        this->ptr(data);
    }
};

更独立分离的实现

template <int N>
class trivial_storage {
    alignas(N) unsigned char buffer[N] = {};

public:
    template <typename T>
    trivial_storage(const T& t) noexcept {
        // static_assert检查
        std::memcpy(buffer, &t, sizeof(T));
    }

    template <typename T>
    T* data() noexcept {
        return std::launder(reinterpret_cast<T*>(buffer));
    }
    // ...
};

class shape {
    using storage = trivial_storage<4 * alignof(float)>;
    storage st;
    using draw_fun = void(const storage&);
    draw_fun* ptr;

public:
    template <typename T>
    shape(const T& t) noexcept
        : st(t), ptr(+[](const storage& st) { st.data<T>()->draw(); }) {}

    void draw() const { this->ptr(st); }
};

simd perlin noise part1

part2

优化代码

f32 PerlinNoise(f32 x, f32 y, f32 z)
{
    // 找到包含该点的单位立方体
    if (x < 0) x -= 1.f;
    if (y < 0) y -= 1.f;
    if (z < 0) z -= 1.f;

    u32 Xi = u32(x) & 255;
    u32 Yi = u32(y) & 255;
    u32 Zi = u32(z) & 255;

    // 计算点在立方体内的相对坐标
    x -= Floorf(x);
    y -= Floorf(y);
    z -= Floorf(z);

    // 计算 x, y, z 的淡入曲线(fade curves)
    f32 u = fade(x);
    f32 v = fade(y);
    f32 w = fade(z);

    // 对立方体 8 个角的坐标进行哈希
    u32 A  = (u32)Global_PerlinHashValues[Xi]  + Yi;
    u32 AA = (u32)Global_PerlinHashValues[A]   + Zi;
    u32 AB = (u32)Global_PerlinHashValues[A+1] + Zi;
    u32 B  = (u32)Global_PerlinHashValues[Xi+1]+ Yi;
    u32 BA = (u32)Global_PerlinHashValues[B]   + Zi;
    u32 BB = (u32)Global_PerlinHashValues[B+1] + Zi;

    // 哇哦……
    return lerp(w,
             lerp(v,
               lerp(u, grad(Global_PerlinHashValues[AA  ], x  , y  , z   ),
                       grad(Global_PerlinHashValues[BA  ], x-1, y  , z   )),
               lerp(u, grad(Global_PerlinHashValues[AB  ], x  , y-1, z   ),
                       grad(Global_PerlinHashValues[BB  ], x-1, y-1, z   ))),
             lerp(v,
               lerp(u, grad(Global_PerlinHashValues[AA+1], x  , y  , z-1 ),
                       grad(Global_PerlinHashValues[BA+1], x-1, y  , z-1 )),
               lerp(u, grad(Global_PerlinHashValues[AB+1], x  , y-1, z-1 ),
                       grad(Global_PerlinHashValues[BB+1], x-1, y-1, z-1 ))));

    res = (res + 1.0f)/2.0f;

    return res;
}

最简单优化,批量

void PerlinNoise_8x(f32 *x, f32 y, f32 z, f32 *Result)
{
    for (u32 Index = 0; Index < 8; ++Index)
    {
        Result[Index] = PerlinNoise(x[Index], y, z);
    }
}

展开lerp

// 呼……终于松了口气

u32 H0 = Global_PerlinHashValues[AA  ];
u32 H1 = Global_PerlinHashValues[BA  ];
u32 H2 = Global_PerlinHashValues[AB  ];
u32 H3 = Global_PerlinHashValues[BB  ];
u32 H4 = Global_PerlinHashValues[AA+1];
u32 H5 = Global_PerlinHashValues[BA+1];
u32 H6 = Global_PerlinHashValues[AB+1];
u32 H7 = Global_PerlinHashValues[BB+1];

f32 G0 = grad(H0, x  , y  , z);
f32 G1 = grad(H1, x-1, y  , z);
f32 G2 = grad(H2, x  , y-1, z);
f32 G3 = grad(H3, x-1, y-1, z);

f32 G4 = grad(H4, x  , y  , z-1);
f32 G5 = grad(H5, x-1, y  , z-1);
f32 G6 = grad(H6, x  , y-1, z-1);
f32 G7 = grad(H7, x-1, y-1, z-1);

f32 L0  = lerp(u, G0, G1);
f32 L1  = lerp(u, G2, G3);
f32 L2  = lerp(u, G4, G5);
f32 L3  = lerp(u, G6, G7);

f32 L4  = lerp(v, L0, L1);
f32 L5  = lerp(v, L2, L3);

f32 res = lerp(w, L4, L5);

lerp批量

lerp实现很简单

f32 lerp(f32 t, f32 a, f32 b)
{
    f32 res = a + t * (b - a);
    return res;
}

考虑4x

union f32_4x
{
    __m128 Sse;
    f32 E[4];
};

inline f32_4x operator+(f32_4x A, f32_4x B)
{
    f32_4x Result = ;
    return Result;
}

inline f32_4x operator-(f32_4x A, f32_4x B)
{
    f32_4x Result = ;
    return Result;
}

inline f32_4x operator*(f32_4x A, f32_4x B)
{
    f32_4x Result = ;
    return Result;
}

// 看,新类型!

f32_4x Lerp4x(f32_4x t, f32_4x a, f32_4x b)
{
    f32_4x res = a + t * (b - a); // 公式完全一样!
    return res;
}

grad

f32 grad(u32 hash, f32 x, f32 y, f32 z)
{
    int h = hash & 15;

    // 将 hash 的低 4 位映射为 12 个梯度方向
    f32 u = h < 8 ? x : y;
    f32 v = h < 4 ? y : (h == 12 || h == 14) ? x : z;

    f32 res = ((h & 1) == 0 ? u : -u) + ((h & 2) == 0 ? v : -v);
    return res;
}

展开

link_internal f32
grad(u32 hash, f32 x, f32 y, f32 z)
{
    auto h = hash & 15;

    auto uSel = h < 8;
    auto vSel = h < 4;
    auto xSel = (h == 12 | h == 14);

    auto u  = uSel ? x : y;
    auto xz = xSel ? x : z;
    auto v  = vSel ? y : xz;

    auto uFlip = (h & 1) == 1;
    auto vFlip = (h & 2) == 2;

    auto R0 = uFlip ? u*-1 : u;
    auto R1 = vFlip ? v*-1 : v;
    auto Result = R0 + R1;

    return Result;
}

每一种操作都实现x4,类似f32_x4,然后拼起来, 另外涉及分支

f32_4x Select(u32_4x Mask, f32_4x A, f32_4x B)
{
    u32_4x Result = ;
    return Result;
}
f32_4x Grad4x(u32_4x hash, f32_4x x, f32_4x y, f32_4x z)
{
    auto _15 = U32_4X(15);
    auto _14 = U32_4X(14);
    auto _12 = U32_4X(12);
    auto _8  = U32_4X(8);
    auto _4  = U32_4X(4);
    auto _2  = U32_4X(2);
    auto _1  = U32_4X(1);
    auto _n1 = F32_4X(-1);

    auto h = hash & _15;

    auto uSel = h < _8;
    auto vSel = h < _4;
    auto xSel = (h == _12 | h == _14);

    auto u  = Select(uSel, x, y);
    auto xz = Select(xSel, x, z);
    auto v  = Select(vSel, y, xz);

    auto uFlip = (h & _1) == _1;
    auto vFlip = (h & _2) == _2;

    auto R0 = Select(uFlip, u * _n1, u);
    auto R1 = Select(vFlip, v * _n1, v);
    auto Result = R0 + R1;

    return Result;
}

现在PerlinNoise_8x这样

// 每次循环迭代计算 4 个噪声格子!

for (u32 VoxelIndex = 0; VoxelIndex < 8; VoxelIndex += 4)
{
    // 哈希部分省略,详见 GitHub 完整函数

    // 这看起来和原始实现惊人地相似。
    // 如果你重载了函数名,几乎看不出这是 SIMD 代码。

    auto G0 = Grad4x(H0,  x,  y, z);
    auto G1 = Grad4x(H1, nx,  y, z);
    auto G2 = Grad4x(H2,  x, ny, z);
    auto G3 = Grad4x(H3, nx, ny, z);

    auto G4 = Grad4x(H4,  x,  y, nz);
    auto G5 = Grad4x(H5, nx,  y, nz);
    auto G6 = Grad4x(H6,  x, ny, nz);
    auto G7 = Grad4x(H7, nx, ny, nz);

    auto L0 = Lerp4x(u4, G0, G1);
    auto L1 = Lerp4x(u4, G2, G3);
    auto L2 = Lerp4x(u4, G4, G5);
    auto L3 = Lerp4x(u4, G6, G7);
    auto L4 = Lerp4x(v4, L0, L1);
    auto L5 = Lerp4x(v4, L2, L3);

    auto Res = Lerp4x(w4, L4, L5);
    Res = (Res + F32_4X(1.f)) / F32_4X(2.f);

    Result[VoxelIndex+0] = Res.E[0];
    Result[VoxelIndex+1] = Res.E[1];
    Result[VoxelIndex+2] = Res.E[2];
    Result[VoxelIndex+3] = Res.E[3];
}

性能提升70%

继续,考虑avx256 直接替换成mm256接口,性能翻倍

借鉴fastnoise2 使用更快的hash

template<typename... P>
FS_INLINE static int32v HashPrimes(int32v seed, P... primedPos)
{
    int32v hash = seed;
    hash ^= (primedPos ^ ...);  // C++17 折叠表达式,对所有 primedPos 进行异或

    hash *= int32v(0x27d4eb2d);
    return (hash >> 15) ^ hash;
}

继续优化,消除数据依赖

auto L0 = Lerp8x(xs, G0, G1);
auto L1 = Lerp8x(xs, G2, G3);
auto L2 = Lerp8x(xs, G4, G5);
auto L3 = Lerp8x(xs, G6, G7);

auto L4 = Lerp8x(ys, L0, L1); // 依赖 L0, L1
auto L5 = Lerp8x(ys, L2, L3); // 依赖 L2, L3

auto Res = Lerp8x(zs, L4, L5); // 依赖 L4, L5  :'(

考虑#pragma unroll(2)降低依赖影响,另外提前计算部分值

分析依赖 llvm-mca


        40.   vblendvps	%ymm5, %ymm6, %ymm3, %ymm1
        41.   vmovups	%ymm1, 736(%rsp)
 +----< 42.   vmovdqa	64(%rdx), %ymm1
 |      43.   vmovdqu	%ymm1, 160(%rsp)
 +----> 44.   vpxor	%ymm0, %ymm1, %ymm3          ## REGISTER dependency:  %ymm1
 |      45.   vmovdqa	%ymm11, %ymm7
 |      46.   vmovdqu	%ymm11, 64(%rsp)
 +----> 47.   vpxor	%ymm3, %ymm11, %ymm1         ## REGISTER dependency:  %ymm3
 |      48.   vmovdqu	%ymm1, 672(%rsp)
 +----> 49.   vpand	%ymm1, %ymm12, %ymm0         ## REGISTER dependency:  %ymm1
 +----> 50.   vpcmpeqd	%ymm0, %ymm8, %ymm0      ## REGISTER dependency:  %ymm0
 +----> 51.   vblendvps	%ymm0, %ymm14, %ymm4, %ymm10  ## REGISTER dependency:  %ymm0
 |      52.   vmovaps	128(%rdx), %ymm0
 |      53.   vpsrld	$15, %ymm1, %ymm1
 |      54.   vmovdqu	%ymm1, 608(%rsp)
 +----> 55.   vpand	%ymm1, %ymm13, %ymm1         ## RESOURCE interference:  HWPort5 [ probability: 99% ]
 |      56.   vmovdqu	%ymm1, (%rsp)
 +----> 57.   vpcmpgtd	%ymm1, %ymm9, %ymm11     ## REGISTER dependency:  %ymm1
 +----> 58.   vblendvps	%ymm11, %ymm0, %ymm10, %ymm1 ## REGISTER dependency:  %ymm11
 |      59.   vmovups	%ymm1, 640(%rsp)
 |      60.   vmovdqa	%ymm2, %ymm1
 |      61.   vmovdqu	%ymm2, 352(%rsp)
 |      62.   vpxor	%ymm2, %ymm3, %ymm2

注意到大量 寄存器依赖(”REGISTER dependency”)和 端口5的资源争用(Port5 contention)。同时,有很多 vblendvps 指令,它们来自我们的 Select 函数。这表明 Grad8x 例程目前是性能瓶颈,是进一步优化的重点。

回顾代码

link_internal f32_8x
Grad8x(u32_8x hash, f32_8x x, f32_8x y, f32_8x z)
{
    auto h = hash & 15;

    u32_8x uSel = h < 8;
    u32_8x vSel = h < 4;
    u32_8x xSel = (h == 12 | h == 14);

    f32_8x u  = Select(uSel, x, y);
    f32_8x xz = Select(xSel, x, z);
    f32_8x v  = Select(vSel, y, xz);

    auto h1 = hash << 31;
    auto h2 = (hash & U32_8X(2)) << 30;
    f32_8x Result = (u ^ Cast_f32_8x(h1)) + (v ^ Cast_f32_8x(h2));

    return Result;
}

uv最终会相加,三个select可以消掉一个

完整代码在这里 https://github.com/scallyw4g/bonsai_stdlib/blob/770baa72a8243ed6c48cc20464d44158feb43d0b/src/perlin.cpp

From Boolean logic to bitmath and SIMD: transitive closure of tiny graphs

从布尔逻辑到位运算与 SIMD:小型图的传递闭包计算

假设我们有一个最多包含 8 个节点的图(也可以理解为 8 个元素之间的关系),用一个 8×8 的布尔矩阵表示。我们的目标是计算这个图的传递闭包(Transitive Closure)——即,如果从节点 A 可以到达 B,从 B 可以到达 C,则闭包中应包含 A 可以到达 C。

这个问题的经典解法是 Warshall 算法(Floyd-Warshall 算法的布尔版本),它通过逐步“扩展”路径来构建闭包

第一版

std::bitset<64> closure8x8(std::bitset<64> g)
{
    for (size_t k = 0; k < 8; k++) {
        for (size_t i = 0; i < 8; i++) {
            if (g[i * 8 + k]) {
                for (size_t j = 0; j < 8; j++) {
                    if (g[k * 8 + j]) {
                        g.set(i * 8 + j);
                    }
                }
            }
        }
    }
    return g;
}

位压缩

uint64_t closure8x8(uint64_t x)
{
    for (size_t k = 0; k < 8; k++) {
        uint64_t kth_row = (x >> (k * 8)) & 0xFF;  // 提取第 k 行
        for (size_t i = 0; i < 8; i++) {
            if (x & (UINT64_C(1) << (i * 8 + k))) {  // 若 i 可达 k
                x |= kth_row << (i * 8);              // 将 k 行或入 i 行
            }
        }
    }
    return x;
}

消除循环依赖,观察到x再内部被改动,剥离出x_old

uint64_t closure8x8(uint64_t x)
{
    for (size_t k = 0; k < 8; k++) {
        uint64_t x_old = x;
        uint64_t kth_row = (x >> (k * 8)) & 0xFF;
        for (size_t i = 0; i < 8; i++) {
            if (x_old & (UINT64_C(1) << (i * 8 + k))) {
                x |= kth_row << (i * 8);
            }
        }
    }
    return x;
}

考虑向量化

uint64_t closure8x8(uint64_t x)
{
    const uint64_t lsb_of_byte = 0x0101010101010101;
    for (size_t k = 0; k < 8; k++) {
        uint64_t kth_row = (x >> (k * 8)) & 0xFF;
        uint64_t mask = (x >> k) & lsb_of_byte;
        x |= kth_row * mask;
    }
    return x;
}

这个操作和MOR非常像

uint64_t MOR(uint64_t a, uint64_t b)
{
    const uint64_t lsb_of_byte = 0x0101010101010101;
    uint64_t result = 0;
    for (size_t i = 0; i < 8; i++) {
        uint64_t row_i = (a >> (i * 8)) & 0xFF;
        uint64_t mask = (b >> i) & lsb_of_byte;
        result |= row_i * mask;
    }
    return result;
}

上面可以改成

uint64_t closure8x8(uint64_t x)
{
    x |= MOR(x, x);   // A → A ∪ A²
    x |= MOR(x, x);   // → A ∪ A² ∪ A³ ∪ A⁴
    x |= MOR(x, x);   // → A ∪ ... ∪ A⁸
    return x;
}

没啥收益

考虑avx512

uint64_t MOR(uint64_t a, uint64_t b)
{
    __m512i va = _mm512_set1_epi64(a);
    __m512i z = _mm512_setzero_si512();
    __m512i m_id = _mm512_set1_epi64(0x8040201008040201);  // 用于转置的常数
    __m512i masked = _mm512_mask_blend_epi8(b, z, va);      // 选择性广播
    __m512i maskedt = _mm512_gf2p8affine_epi64_epi8(m_id, masked, 0);
    return _mm512_cmpneq_epi8_mask(maskedt, z);  // 提取结果
}

吞吐低,占用带宽多,处理的少

考虑avx256

__m256i closure8x8_avx2(__m256i g)
{
    const uint64_t ones = 0x0101010101010101;
    __m256i z = _mm256_setzero_si256();

    // 展开 k = 0 到 7 的每一步
    for (int k = 0; k < 8; k++) {
        __m256i shuf = _mm256_broadcastsi128_si256(
            _mm_setr_epi64x(ones * k, ones * (k + 8))
        );
        __m256i kth_rows = _mm256_shuffle_epi8(g, shuf);           // 提取每图的第 k 行
        __m256i mask_bits = _mm256_slli_epi64(g, 7 - k);           // 将第 k 位列移到字节最高位
        __m256i mask = _mm256_blendv_epi8(z, kth_rows, mask_bits); // 条件 OR
        g = _mm256_or_si256(g, mask);
    }

    return g;
}

或者手写循环展开

__m256i closure8x8_avx2(__m256i g)
{
    const uint64_t ones = 0x0101010101010101;
    __m256i z = _mm256_setzero_si256();
    __m256i m, shuf;

#define STEP(k) \
    shuf = _mm256_broadcastsi128_si256(_mm_setr_epi64x(ones * (k), ones * (k + 8))); \
    m = _mm256_blendv_epi8(z, _mm256_shuffle_epi8(g, shuf), _mm256_slli_epi64(g, 7 - (k))); \
    g = _mm256_or_si256(g, m);

    STEP(0)
    STEP(1)
    STEP(2)
    STEP(3)
    STEP(4)
    STEP(5)
    STEP(6)
    STEP(7)

#undef STEP
    return g;
}

Destructive in-order tree traversal

考虑中序遍历

template<std::movable T>
struct node
{
    T value;
    node* parent;
    node* left;
    node* right;
};
template<std::movable T, std::output_iterator Iterator>
void move_to(node<T>* node, Iterator& out)
{
    if (node == nullptr) return;
    move_to(node->left, out);
    *out++ = std::move(node->value);
    move_to(node->right, out);
}

这是递归版本,写个迭代版本

template<std::movable T, std::output_iterator Iterator>
void move_to_1(node<T>* root, Iterator out)
{
    node<T>* current = root;
    node<T>* previous = nullptr;

    while (current) {
        if (previous == current->parent) {
            goto from_parent;
        } else if (previous == current->left) {
            goto from_left;
        } else if (previous == current->right) {
            goto from_right;
        }
        
        from_parent:
            if (current->left) {
                previous = std::exchange(current, current->left);
                continue;
            }
        from_left:
            *out++ = std::move(current->value);
            if (current->right) {
                previous = std::exchange(current, current->right);
                continue;
            }
        from_right:
            previous = std::exchange(current, current->parent);
    }
}

这段代码由 StackOverflow 用户 @OmarOthman 提供,基于 @svick 的早期回答。其核心思想是:知道“从哪来”,就能决定“往哪去”:

这里用goto更明白,就不要说啥goto不好了:

既然都goto了,能不能继续优化?显然第一个分支没用,不匹配自然进第一个label,需要改成循环,从这里驱动就好

template<std::movable T, std::output_iterator Iterator>
void move_to_2(node<T>* root, Iterator out)
{
    node<T>* current = root;
    node<T>* previous = nullptr;

    goto from_parent;
    while (current) {
        // 移除了 (previous == current->parent) 的检查
        if (previous == current->left) {
            goto from_left;
        } else if (previous == current->right) {
            goto from_right;
        }
        
        from_parent:
            while (current->left) {  // 紧凑循环,直接向左深入
                previous = std::exchange(current, current->left);
            }
        from_left:
            *out++ = std::move(current->value);
            if (current->right) {
                previous = std::exchange(current, current->right);
                goto from_parent;  // 直接跳回 from_parent
            }
        from_right:
            previous = std::exchange(current, current->parent);
    }
}

更极端一点,如果这个输用完就丢,我们可以重写父节点

当从某个节点向其右子节点移动时,是否可以直接将其父指针指向祖父节点

每当从子节点返回,我们都“假装”是从左子节点回来的:访问当前节点 → 检查右子节点 → 检查父节点。

这完全抹平了 from_left 和 from_right 的区别!因此可以合并为一个 from_child 标签

template<std::movable T, std::output_iterator Iterator>
void move_to_3(node<T>* root, Iterator out)
{
    node<T>* current = root;

    from_parent:
        while (current->left) {
            current = current->left;
        }
    from_child:
        *out++ = std::move(current->value);
        if (current->right) {
            auto previous = std::exchange(current, current->right);
            current->parent = previous->parent;  // 重写父指针
            goto from_parent;
        }
        current = current->parent;
        if (current) {
            goto from_child;
        }
}

当然,树也可以恢复,不算破坏

Stepanov’s biggest blunder

adjacent_difference接口设计失误

// 原始序列:12586 13012 13560 13670 14236
// 输出结果:12586 426 548 110 566
std::adjacent_difference(postings.begin(), postings.end(), compressed.begin());

stl作者的设计思路是和partial_sum逆运算

问题1 差分输出单位变化,比如时间戳

问题2 多一个元素,冗余

替代品 pairwise_transform

开源项目介绍


上一期

本期

下一期

看到这里或许你有建议或者疑问或者指出错误,请留言评论! 多谢! 你的评论非常重要!也可以帮忙点赞收藏转发!多谢支持! 觉得写的不错那就给点吧, 在线乞讨 微信转账