TL,DR: 短代码cmov快 长代码jmp快 cmov如果依赖特别严重会有性能衰退

原文

加了一些自己的理解

所谓的branch-free编程 或者说 branch-less编程,就是一种优化if分支降低branch miss的手段

比如借助冒号表达式,借助+=等等

而这种优化通常都会优化成cmov

大概十几年前 linus对于cmov不屑一顾,性能表现不行

测试代码在这里

#include <stdio.h>

/* How many iterations? */
#define ITERATIONS (100000000)

/* Which bit of the counter to test? */
#define BIT 1

#ifdef CMOV

#define choose(i, a, b) ({			\
	unsigned long result;			\
	asm("testl %1,%2 ; cmovne %3,%0"	\
		:"=r" (result)			\
		:"i" (BIT),			\
		 "g" (i),			\
		 "rm" (a),			\
		 "0" (b));			\
	result; })

#else

#define choose(i, a, b) ({			\
	unsigned long result;			\
	asm("testl %1,%2 ; je 1f ; mov %3,%0\n1:"	\
		:"=r" (result)			\
		:"i" (BIT),			\
		 "g" (i),			\
		 "g" (a),			\
		 "0" (b));			\
	result; })

#endif

int main(int argc, char **argv)
{
	int i;
	unsigned long sum = 0;

	for (i = 0; i < ITERATIONS; i++) {
		unsigned long a = 5, b = 7;
		sum += choose(i, a, b);
	}
	printf("%lu\n", sum);
	return 0;
}

我测试了一下

gcc -Wall -O2 cmov.c

time ./a.out
600000000

real 0m0.540s
user 0m0.065s
sys 0m0.001s

gcc -Wall -O2 -DCMOV cmov.c
time ./a.out

600000000

real 0m0.522s
user 0m0.055s
sys 0m0.000s

显然cmov更快 linus这个结论过时了,毕竟07年。

新机器cmov是快的。这个测试比较简单没有引入cacheline的影响

这里有个测试显然cmov也是快的

Agner Fog的文档里也说了

(for “a = b > c ? d : e;”): As a rule of thumb, we can say that a conditional jump is faster than a conditional move if the code is part of a dependency chain and the prediction rate is better than 75%. A conditional jump is also preferred if we can avoid a lengthy calculation of d or e when the other operand is chosen.

简单说就是依赖性弱cmov快否则jmp快 (你问chatgpt也是这个回答)

那么 __builtin_expect 会生成cmov吗?

答案是可能,但不一定,简单代码还是可以生成的不能,但是__builtin_expect_with_probability 一定不行

godbolt

#define LIKELY(x) (__builtin_expect(!!(x), 1))
#define VERY_LIKELY(x) (__builtin_expect_with_probability(!!(x), 1, 0.999))

int foo1(int a, int b, int c)
{
    if (a == b)
        a &= c;
    return a;    
}

/*
foo1:
        and     edx, edi
        mov     eax, edi
        cmp     edi, esi
        cmove   eax, edx
        ret
*/

int foo2(int a, int b, int c)
{
    if (LIKELY(a == b))
        a &= c;
    return a;    
}
/*
foo2:
        and     edx, edi
        mov     eax, edi
        cmp     edi, esi
        cmove   eax, edx
        ret
*/
int foo3(int a, int b, int c)
{
    if (VERY_LIKELY(a == b))
        a &= c;
    return a;    
}


/*
foo3:
        mov     eax, edi
        cmp     edi, esi
        jne     .L7
        and     eax, edx
.L7:
        ret
*/

冒号表达式一定能cmov吗?

这个例子来自lemire

godbolt

while ((pos1 < size1) & (pos2 < size2)) {
  v1 = input1[pos1];
  v2 = input2[pos2];
  output_buffer[pos++] = (v1 <= v2) ? v1 : v2;
  pos1 = (v1 <= v2) ? pos1 + 1 : pos1;
  pos2 = (v1 >= v2) ? pos2 + 1 : pos2;
}

这个代码无法预测,没法优化成cmov,一会大于一会小于,怎么假设啊

但是我们可以利用交换 +=

while ((pos1 < size1) & (pos2 < size2)) {
  v1 = input1[pos1];
  v2 = input2[pos2];
  output_buffer[pos++] = (v1 <= v2) ? v1 : v2;
  pos1 += (v1 <= v2);
  pos2 += (v1 >= v2);
}

LLVM 后端引入分支避免生成cmov

godbolt


#include <cstddef>
#include <cstdint>
#include <utility>

void downHeap(uint64_t* top, size_t size, size_t pos) {
  size_t parent = pos;
  size_t child;
  while ((child = 2 * parent + 2) < size) {
    auto left = child - 1;
    child = top[left] < top[child] ? child : left;  // <<-- Unpredictable, should be CMOV
    if (top[parent] < top[child]) {
      std::swap(top[parent], top[child]);
      parent = child;
    } else {
      return;
    }
  }
  if (--child < size && top[parent] < top[child]) {
    std::swap(top[parent], top[child]);
  }
}


用了cmov 但是性能反倒是更差 为了避免这个bug

引入 __builtin_unpredictable 的出发点是可以屏蔽掉cmov生成,但是只能影响middle-end

后段select会不会优化成cmov不能通过__builtin_unpredictable修正

无分支代码一定会生成cmov吗

不一定,可能会插入分支

比如 本应该生成cmov llvm给加分支了

godbolt


#include <cstddef>
#include <cstdint>

// Select an element from an array in constant time.
uint64_t constant_time_lookup(const size_t secret_idx,
                              const uint64_t table[8]) {
  uint64_t result = 0;
  for (size_t i = 0; i < 8; i++) {
    const bool cond = i == secret_idx;
    const uint64_t mask = (-(int64_t)cond);
    result |= table[i] & mask;
  }
  return result;
}

这个例子我不是很理解,貌似是密码学防止攻击,故意不优化成cmov

gcc一个例子 godbolt

unsigned foo(unsigned a, unsigned b)
{
  unsigned t = ((a > 2) != 0) << 1;
  t |= ((a < 10) != 0) << 2;
  return b >> t;
}

再比如

unsigned r = ((a & 0xffff0000) != 0) << 4;

会改成

unsigned r;
if (a > 65535)
  r = 16;
else
  r = 0;

结论

  • cmov不是万能药,短代码cmov快 长代码jmp快 cmov如果依赖特别严重会有性能衰退
  • __builtin_expect是优化预测分支的,不是优化无分支的,__builtin_expect_with_probability可以优化无分支
  • __builtin_unpredictable 想帮忙屏蔽cmov,但没啥用
  • 编译器可能会给简单的无分支代码加上分支

其他