german string
17 Oct 2024
|
|
prefix计算,空间换时间
https://cedardb.com/blog/german_strings/
https://cedardb.com/blog/strings_deep_dive/
这里有来源介绍
umbra论文的来的,主要是为了小字符串 12B 做优化
arrow
arrow
* Short strings, length <= 12
| Bytes 0-3 | Bytes 4-15 |
|------------|---------------------------------------|
| length | data (padded with 0) |
* Long strings, length > 12
| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 |
|------------|------------|------------|-------------|
| length | prefix | buf. index | offset |
buf index表示第几个buffer buf内的[offset,offset+length) 表示字符串
https://pola.rs/posts/polars-string-type/
https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout
常规
还有一种常规设计
* Short strings, length <= 12
| Bytes 0-3 | Bytes 4-15 |
|------------|---------------------------------------|
| length | data (padded with 0) |
* Long strings, length > 12
| Bytes 0-3 | Bytes 4-7 | Bytes 8-15 |
|------------|------------|------------|
| length | prefix | ptr |
比较直观
bool isEqual(data128_t a, data128_t b) {
if (a.v[0] != b.v[0]) return false;
auto len = (uint32_t) a.v[0];
if (len <= 12) return a.v[1] == b.v[1];
return memcmp((char*) a.v[1], (char*) b.v[1], len) == 0;
}
对于小字符串 能省非常多
不过std::string也有sso,对于小字符串,但是没有prefix优势/16B传参优势
duckdb实现 duckdb/blob/main/src/include/duckdb/common/types/string_type.hpp