blog review 第十八期 · 王很水的笔记

What is Backoff For?

在乐观事务冲突中，作者将了个点子，指数退避 Exponential backoff，降低冲突。也就是推迟一会。推迟多久合适呢，这很影响延迟。按照TCP窗口那种思路。可能

sleep = min(cap, base * 2 ** attemp)事件慢慢变长。这个思路

这个的问题在哪里？首先作为事务，你延迟肯定爆炸了。不在乎那点重试事件，但是，这个时间可能完全不需要这个时间。可能重试马上就成功，可能刚好用完这个时间窗

这个指数级别的时间还是太粗糙了

所以加上个随机就解决了sleep = random_between(0, min(cap, base * 2 ** attemp))

这个英文来说是引入Jitter 字面意思就是抖动一下，让这个时间模糊一下，而不是一直指数级别递增。让客户端们错开

这种错开思维有各种各样的展开思路，比如锁分片，比如redis过期时间别相同，差个几秒。

也可以调整边界。不过效果没有这个好。这里有模拟代码 https://github.com/aws-samples/aws-arch-backoff-simulator

这个场景的前提是，都是短时间大量竞争，乐观重试。这种场景。如果强度一直这么大，这种策略会把整体的延迟都拉低一个档次，造成雪崩。这个策略本质是推迟

推迟的本质是当前忙，过一会不忙，如果一直忙，推迟没有任何意义

The only way to deal with long-term overload is to reduce load, deferring load does not work.

这种策略做重试反而加重负担。还是用常规的重试策略。这里不是说指数退避重试不行。

一个好的重试策略是啥样的？经典方案，重试三次。失败就踢掉复杂方案，重试令牌桶，桶里N个，失败+1成功-1，桶占满就彻底失败，踢掉统计模式，失败率上升到一定程度就失败，踢掉

这三种方案各有不同适应场景。重试三次，简单，但是很武断，统计模式也算简单，负载也低。这两种是类似的，问题在于可能下一次就成功了重拾令牌桶，最悲观场景，和重试三次一样负载高。但成功率更高一些

再结合一下上面的指数退避模式，结合一下。还要考虑业务场景

如果连接固定，那么指数退避规避推迟一下就挺有效，如果连接非常多，压力一直持续，那么指数退避就不太行

不一定是连接数，也可能是任务数，也可能是线程数。大家都是一个展开思路。不要想的太死

作者给了一组模拟实验，和上面说的差不多

我佩服这哥们的一点就是他总是能把他的想法用代码模拟一下画出图来。我脑子就是一团浆糊，没法明确

The future is seamless and collaborative

介绍物化视图以及维护以及引出dataflow概念以及介绍 Materialize这个公司的产品

以及介绍双向的dataflow。以及讨论了一波CDRT。你们概念真特么多

网络通信中收发数据的正确姿势

在 select、poll 和 epoll 的 LT 模式下，可以直接设置检测 fd 的可读事件；
在 select、poll 和 epoll 的 LT 模式下不要直接设置检测 fd 的可写事件，应该先尝试发送数据，因为 TCP 窗口太小发不出去再设置检测 fd 的可写事件，一旦数据发出去应立即取消对可写事件的检测。
在 epoll 的 ET 模式下，需要发送数据后，每次都要设置检测可写事件。

这里有个描述错误，感谢 @山旮旯大侠 指正

Reducing Logging Cost by Two Orders of Magnitude using CLP

整了个log压缩工具，clp，来处理spark平台每天日增200G的日志数据

原理

这玩意是个列存，支持搜索的列存

那为啥不直接用时序数据库呢？效果不是一致的么？

作者只对比了列存压缩和使用gzip zstd压缩率的对比，没有和时序数据库的对比，实际上都是列存带来的优势，用时序不就一劳永逸了？

现成服务我觉得也做好了存储吧，比如fluent-bit 还支持sql

这里也有讨论 https://news.ycombinator.com/item?id=33032996

Index Structures, Access Methods, whatever

当我想到分类的时候，人家已经提前五年分好类在这里等我了

基于树Tree组织数据
- 基于Page
  - 就地更新Page型， UiP 例子：innodb
  - 复制修改回写型， COW-R COW-S R表示随机刷一个回去，S表示顺序刷回。S没有用的 COW-R典型代表， lmdb wiretiger
- LSM 例子leveldb
- index + log 比如rocksdb改良版 wiskey，blobdb，比如tokudb ForestDB （这个有压测，rocksdb吊锤）
基于hash组织数据（牺牲range信息）
- page bdb支持设置hash模式？ https://github.com/hyc/BerkeleyDB/ 代码我没有研究过
- LSM 比如hashkv https://github.com/LSMdb/HashKV SILK 这个我没找着，论文 https://www.cs.cmu.edu/~dga/papers/silt-sosp2011.pdf
- index + log 比如 bitcask，实际上这个也是 tree，hash tree，杂交品种，也是对数复杂度，另外微软的FASTER 也是这种模式。快。我之前有博客提到过

从经典的Read Update Memory 三角来分析

b-tree 读不错，牺牲写，最差最差，每次的更新都很小，都发生了回写，对于cache效率，聚集索引好过非聚集索引
- for a clustered index you should cache at least one key/pointer per leaf block but for a non-clustered index you need the entire index in RAM or there will be extra storage reads
- 空间放大也还好，最大的问题是碎片
LSM 写不错，牺牲读，空间利用率也可以，最大的问题是额外的浪费CPU（b-tree也得vacuum，也省不了吧）以及写放大
index + log 读写都可以，主要问题在于index本身，需要内存装一大部分

Deletes are fast and slow in an LSM

single delete API，当且仅当put一次，可以用，不然行为未定义，另外介绍了myrocks的配置 rocksdb_compaction_sequential_deletes 其实就是统计然后compact。一种优化

myrocks使用rocksdb还是比较激进的，一直用最新的

How I do performance tests for RocksDB

几个脚本。拷贝到这里来，主要是我家的网络打不开他的博客。。

[tools/benchmark.sh](https://github.com/facebook/rocksdb/blob/main/tools/benchmark.sh) - runs one benchmark step by invoking db_bench
[tools/benchmark_compare.sh](https://github.com/facebook/rocksdb/blob/main/tools/benchmark_compare.sh) - runs a benchmark by invoking a sequence of benchmark steps
[x.sh](https://github.com/mdcallag/mytools/blob/master/bench/rx2/x.sh) - selects configuration options based on HW size then calls benchmark_compare.sh
[x3.sh](https://github.com/mdcallag/mytools/blob/master/bench/rx2/x3.sh) - selects configuration options based on workload (IO-bound vs cached) then calls x.sh

The x3.sh script is invoked by me. It defines four workloads:

byrx
    byrx is short for cached by RocksDB and the database fits in the RocksDB block cache
byos
    byos is short for cached by OS and the database fits in the OS page cache but is larger than the RocksDB block cache. This simulates fast storage for reads and lets me increase stress on the RocksDB block cache code.
iobuf
    iobuf is short for IO-bound with buffered IO. The database is larger than RAM and RocksDB uses buffered IO.
iodir
    iodir is short for IO-bound with O_DIRECT. The database is larger than RAM and RocksDB uses O_DIRECT for user reads and compaction.

使用

nohup bash x3.sh 22 no 1800 c30r240 40000000 2000000000 iodir &

How I do RocksDB performance tests, part 2

列了一些要点，测什么

This extends on my previous post. This post isn’t specific to RocksDB. It also has more opinions and might serve as speaker notes were I to write slides. I am writing this prior to speaking to peers about my work so it might have an audience of one (me) but that is typical of many of my posts. Regardless, I appreciate that people read and engage with some of my posts.

Points

How did I get here?
    Long ago I worked on DBMS internals full time - I added features, fixed bugs and created bugs. Then I moved to web-scale MySQL at Google and started to spend time in production. Production is a great education but it came at the cost of less time for new features. I still spent much time finding and fixing bugs. After a few years I moved to Facebook and the trend continued. Over time I stopped doing feature development, spent much less time fixing bugs but still spend time reporting bugs. I read a lot of code when trying to explain things, but I don't write much that makes it upstream. I have enjoyed the change. I don't need to write code because I am surrounded by talented developers. I can specialize in my thing, and others specializing in their thing. It is hard to be expert in too many things.
Benchmarks are what you make of them
    They are far from perfect but they are quite useful. Testing by comparing things and explaining the results makes them more value. Benchmarks informed by production are even better.
How does code age?
    Single-thread performance on CPUs isn't improving like it used to. Long-lived code tends to attract CPU regressions. This combination is a problem. Good regression tests help spot the problems and spotting them early is a big deal because removing them long after they arrive is too expensive. This isn't just a technical problem. How do you reject new features that help a fraction of the user base when the cost if more CPU overhead for all users?
Needs improvement
    I hope to get better about using benchmarks that avoid coordinated omission, have more statistical rigor, expand beyond single-node tests, use benchmark workloads that are adaptive and use benchmarks that enforce response time constraints.
Build a network of smart peers
    I have been fortunate to have many talented peers. I engage with Postgres experts on Twitter and have also met smart people who work on upstream projects via bug report discussions. 
Explain things
    Explain your results. But find a balance because explaining everything will slow you down.
Testing an LSM is complicated
    Old posts on this are here and here.
    The shape of an LSM tree has more variance than the shape of a B-tree. This can be a source of variance in benchmarks, especially in read-heavy ones. While this is still a work in progress there are db_bench commands to make the LSM shape more deterministic (flush memtable, compact L0, compact L1, wait-for-compaction).
    Another problem is a test that inherits a non-deterministic amount of compaction debt. If the sequence is: —benchmarks=write-heavy,read-heavy then the read-heavy step might suffer from compaction debt inherited from write-heavy. The impact of reducing this debt during the read-heavy step can vary and produce confusing results for the read-heavy step.
    Try to get the LSM tree into a steady state before read-heavy tests. For example, after fillseq there is no overlap between SSTs. After a full compaction there is only one level (or one SST). These are usually not steady states.
    For a load + query benchmark it is easy for the LSM (or B-Tree) to not be in a steady state after the load and many benchmarks suffer from this. If the load is done in key order then the PK index won’t be fragmented with a B-Tree and the SSTs won’t overlap with an LSM — which hides some of the overhead that occurs during query processing. When storage is a local attached SSD and the workload is heavy on IO then you need to worry about non-determinism from the SSD — you either want no SSD GC or to get SSD GC into a steady state (by running the test for long enough and having database files that are large enough, something between 50% and 90% of the device capacity).
Make the DBMS unhappy
    Find ways to make the DBMS unhappy and see if it falls over. The challenge is that there are more and less realistic ways to make a DBMS fall over. An easy way to make a DBMS unhappy is to provide it with too many concurrent requests, especially a DBMS that doesn’t provide admission control (like RocksDB). But some problems are best fixed elsewhere because fixes have an opportunity cost. It might be nice to have an optional RocksDB API that implements admission control. 
Define your goals
    Do you care about average throughput or outliers (p99, p99.9, p99.99). I have a post on this. Average throughput is easy to measure but p99 and beyond (p99.9, p99.99) matters in production because outliers determine user experience and capacity planning is based on p99. While single-valued metrics like p99 are easy to share, graphs for throughput over time at 1-second intervals make it easier to spot problems like stalls, cyclic behavior or throughput that degrades over time.
Statistical rigor
    Statistical rigor is great but can be expensive. Repeating every benchmark 3 times increases the accuracy of your results at the cost of 3X more HW. I usually get less rigorous statistical rigor because I frequently repeat benchmark runs because I made a mistake or need to measure one more thing. Another way to think of this is: assume B units of HW capacity, each benchmark has a warmup cost of W and runtime of R. Then solve for N in B = N(W+R) where N is the number of times the benchmark is repeated. A larger value for N implies a smaller value for R and the confidence interval is a function of both N and R.
Coordinated omission
    Coordinated omission is a real problem. All of the benchmark clients that I use suffer from it, yet they are still useful. Two things prevent me from doing open-loop benchmarks. First, the benchmark clients I use don’t support it and it takes work to implement a new benchmark client and incorporate it into my workflow. Second, an open-loop benchmark takes more work to setup as I need to discover an arrival rate that the DBMS can handle — or I need a more complicated client that can discover it for me. One day I will use open-loop clients.
Response time constraints
    The db_bench benchmark client for RocksDB doesn't have an option to use response time constraints (ignore responses slower than X ms). Another problem is computing throughput without a response time constraint. More concurrency usually means more throughput, but it also means worse response time and more response time outliers. Those slow responses should not be counted. Most of the benchmark clients that I use don’t enforce a response time SLA. Such an SLA is more work, you need to select a reasonable value, but I hope to improve with this. I hope to add them to db_bench.
Single node
    Most of my testing runs the client and server on the same server. While I prefer to use separate servers for client & server when the DBMS supports it, that introduces the risk of perf variance because I will be sharing the network.
Stable platform
    I use HW at work, in the public cloud and my home test cluster. My work HW has value-added services that consume a variable and occasionally significant amount of compute and storage so I am wary of using it for low-concurrency benchmarks. Public cloud HW means I am using a VM and might be sharing compute and storage with noisy neighbors so I found a way to reduce the CPU variance by using the largest number of CPUs for a given instance type and disabling HT. From quick tests with fio there wasn't much variance in the cloud block storage I chose. My home HW is the most stable after I disabled HT and turbo boost. Alas, it is also the least capable — 4 CPUs, 16G of RAM.
Compare things 
    I rarely test one thing in isolation because the results are hard to interpret. So I do A/B or even A/B/C/D/... testing where these represent different DBMS, different versions of the same DBMS or different configurations for one version of one DBMS.
Measure things
    Start with throughput, then add variance, then add CPU, IO and memory. Foreground CPU and IO can remain constant while background CPU and IO change significantly and most DBMS do much work in the background (excluding SQLite which doesn’t have background threads. Don’t forget to watch VSZ/RSS for the DBMS processes because increases there might lead to OOM. Has disk space usage increases because that can lead to out of space errors. When something is slower search top down. Look at iostat metrics to see if IO/query has changed. Look at vmstat to see if CPU/query has changed. Look at vmstat to see if context switches/query has changed (mutex contention?). Normalize your metrics — IO/query, CPU/query, context switches/query. I frequently have scripts running that scrape output from ps and top. To watch for disk space issues I have a script that runs du and ls in a loop during benchmarks.
Summarize things
    One practice I have it to create one line performance summaries with useful numbers for throughput, HW (CPU/storage/memory/disk space) usage, normalized HW usage (CPU/query, IO/query). One line summaries make it easy to compare performance when A/B or A/B/C/D/... testing is done. They also make it easy to spot regressions that don't directly hurt throughput but are a concern -- larger RSS for the DBMS process, more disk space used, more CPU consumed by background threads. The summaries also provide a starting point when I try to explain a performance change. An example is here.
Name & archive things
    A mistake I have made many times is starting a benchmark, getting interrupted for a week and forgetting an important detail about the benchmark when I return. Naming patterns reduces the change of this. I try to archive the test scripts and command lines via Github. Saving RocksDB LOG files is also important. All of my important scripts are in Github.
Adaptive benchmark clients
    I often have to assume things when configuring a benchmark client. The number of threads that db_bench uses for clients is currently fixed. It would be nice to have some benchmarks that increase the request rate or number of request clients over time or until a response time constraint is violated. I currently do this manually and my solution is sad.
Proactive vs reactive
    Is it a bug when it has yet to happen in production? That is an interesting question. The answer requires nuance. Some bugs do happen but have yet to be noticed, some bugs might happen and are worth avoiding other bugs just aren't worth fixing. It isn't always easy to classify a bug into one of these groups.

Default options in MyRocks

基本配置

    rocksdb_default_cf_options=block_based_table_factory={
        cache_index_and_filter_blocks=1;
        filter_policy=bloomfilter:10:false;
        whole_key_filtering=1};
    level_compaction_dynamic_level_bytes=true;
    optimize_filters_for_hits=true;
    compaction_pri=kMinOverlappingRatio

不同层不同压缩策略

compression_per_level=kNoCompression:kNoCompression:kNoCompression:kCompression:$fast:$fast;bottommost_compression=$slow

没啥说的 fast用lz4/snappy，slow用zstd

另外有个compaction triger，如果tombstone过多

     rocksdb_compaction_sequential_deletes   0
     rocksdb_compaction_sequential_deletes_count_sd  O
     rocksdb_compaction_sequential_deletes_file_size 0
     rocksdb_compaction_sequential_deletes_window    0

这个是myrocks层实现的，统计计数，然后NeedCompact

代码

对自己的业务有需要的可以自己配置

Magma, a new storage engine for Couchbase

论文 https://www.vldb.org/pvldb/vol15/p3496-lakshman.pdf、

也是index log模式

A summary of the implementation: writes are persisted to a WAL and then added to the write cache. I assume each document gets a unique seqno at that time. the write cache (RocksDB memtable) buffers KV pairs with 2 skiplists. The paper first describes them as the active and immutable lists. Later it states one skiplist orders by PK and the other by seqno. I believe the later description. When full the documents are appended to the tail of the open log segment and the (key,seqno, doc size) tuples are flushed to the LSM index. an LSM tree index stores (key, seqno, doc size) tuples to map a document PK to a seqno the Log Structured Object Store is composed of log segments. Per-segment metadata includes the min and max seqno in that segment and the seqno values do not overlap between segments. there is a document cache and index block cache. The index block cache caches blocks from the LSM index and from the per-segment B-tree indexes.

这里也有介绍 https://zhuanlan.zhihu.com/p/572612966

相比于经典的 WiscKey，这篇论文最大的贡献在于 GC 的设计，其不需要像 WiscKey 在 GC 时扫描整个 value log 并进行 point get 查询，而是首先对 key sst 进行 compact 产生 deleted key set，再根据 deleted key set 中记录的 value sequence id 计算得出每段 value log 对应的垃圾数据的比例，只需要对垃圾比例较高的 value log 段与邻接段进行 merge 即可。这样既可以避免 WiscKey GC 时的大量 point get，也可以根据垃圾比例对 value log 进行分段 compact，降低写入放大。

这个问题在于空间放大吧，空洞率不满足条件的占用无法释放。但是GC不扫描log文件还是挺有新意的。fasterkv 的compact就要扫描log文件，这个读放大是非常恐怖的。如果能得知每段文件哪些key被删了，效果确实会好很多

不过fasterkv还有个问题，就是log文件本身也有链表信息，这个信息很难受，不能删掉

Hyping the hyper clock cache in RocksDB

rocksdb 7.7.3 hyper clock cache比LRU cache有巨大性能提升

Small servers for database performance tests

老哥的NUC配置，我的笔记本也打算这么搞一下

mount配置

UUID=...  /data  xfs  noatime,nodiratime,discard,noauto  0 1

这个time啥的我没配置

调试工具，ubuntu默认关闭

echo -1 > /proc/sys/kernel/perf_event_paranoid
echo 0 > /proc/sys/kernel/yama/ptrace_scope
sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"

echo -1 > /proc/sys/kernel/perf_event_paranoid
echo 0 > /proc/sys/kernel/yama/ptrace_scope
sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"
echo 1 > /proc/sys/kernel/sysrq
echo x > /proc/sysrq-trigger
mount -o noatime,nodiratime,discard,noauto /dev/nvme0n1 /data
echo '0' > /sys/devices/system/cpu/cpufreq/boost

tsc unstable

    edit /etc/default/grub -> GRUB_CMDLINE_LINUX_DEFAULT="clocksource=tsc tsc=reliable"
    update-grub
    reboot

关闭大页

echo $1 > /sys/kernel/mm/transparent_hugepage/defrag
echo $1 > /sys/kernel/mm/transparent_hugepage/enabled
cat /sys/kernel/mm/transparent_hugepage/defrag
cat /sys/kernel/mm/transparent_hugepage/enabled

或者放在grub里

GRUB_CMDLINE_LINUX_DEFAULT="transparent_hugepage=never"
update-grub

网络设置就算了

查看ssd

sudo nvme smart-log /dev/nvme0
smartctl -A /dev/sda

Quantifying storage on Linux

#    /sys/block/$device/queue/*
#    lsblk -t $dev
#    xfs_info

# v3.small
$ lsblk -t /dev/nvme0n1
NAME    ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE  RA WSAME
nvme0n1         0    512      0     512     512    0 none     1023 128    0B

# v4.small
$ lsblk -t /dev/nvme0n1
NAME    ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE  RA WSAME
nvme0n1         0    512      0     512     512    0 none      255 128    0B

# GCP
$ lsblk -t /dev/sdb
NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE  RA WSAME
sdb          0   4096      0    4096     512    0 none     8192 128    4G



#From /sys/block/$device/queue/$name
#v3      v4      GCP     name
#512     512     4096    physical_block_size
#512     512     512     logical_block_size
#512     512     512     hw_sector_size
#512     512     4096    minimum_io_size
#512     512     4096    physical_block_size
#none    none    none    scheduler
#512     512     4096    discard_granularity
#1280    256     246     max_sectors_kb
#nvme0n1 nvme0n1 sdb     $device


$ xfs_info /dev/nvme0n1
meta-data=/dev/nvme0n1           isize=512    agcount=4, agsize=30524162 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=0 inobtcount=0
data     =                       bsize=4096   blocks=122096646, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=59617, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


# GCP

$ xfs_info /dev/sdb
meta-data=/dev/sdb               isize=512    agcount=4, agsize=196608000 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=0 inobtcount=0
data     =                       bsize=4096   blocks=786432000, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=384000, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Building a high-performance database buffer pool in Zig using io_uring’s new fixed-buffer mode

大部分linux内核都死在5.x了。想用也用不到

Simple, Fast, and Scalable Reverse Image Search Using Perceptual Hashes and DynamoDB

一个图用相似hash算法生成几段hash 比较图片就比较hash的汉明距离，因为图片相似的汉明距离肯定低，通过这两种来实现相似图片查找。拿dynamodb当kv用。

k8s分析流程图

Umbra: A Disk-Based System with In-Memory Performance (Thomas Neumann)

论文在这里 https://www.cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf 他这个ppt我没有。讲的很多是buffer pool的设计，跟这个博客说的差不多 https://nan01ab.github.io/2020/12/Umbra.html

我直接把结论贴过来吧。没开源，看不出什么东西来

有一些设计学的 https://github.com/leanstore/leanstore

Database Implementation For Modern Hardware - Thomas Neumann

这哥们是hyper前开发，创业了，整了个umbra。这个是他讲的课。我简单根据ppt总结一下

有点长，当最后一个了

disk

Techniques to speed up disk access: • do not move the head for every single tuple • instead, load larger chunks • typical granularity: one page • page size varies. traditionally 4KB, nowadays often 16K and more • page size is a trade-off

The page structure is very prominent within the DBMS • granularity of I/O • granularity of buffering/memory management • granularity of recovery Page is still too small to hide random I/O though • sequential page access is important • DBMSs use read-ahead techniques • asynchronous write-back

buffer

Some pages are accessed very frequently • reduce I/O by buffering/caching • buffer manager keeps active pages in memory • limited size, discards/write back when needed • coupled with recovery, in particular logging Basic interface:

FIX(pageNo,shared)
UNFIX(pageNo,dirty) Pages can only be accessed (or modified) when they are fixed.

Buffer Frame

Maintains the state of a certain page within the buffer

pageNo the page number
latch a read/writer lock to protect the page(note: must not block unrelated pages!)
LSN: LSN of the last change, for recovery (buffer manager must force the log before writing)
state clean/dirty/newly created etc.
data the actual data contained on the page

(will usually contain extra information for buffer replacement)

Usually kept in a hash table

Buffer Replacement

FIFO, LRU, 2Q, LFU

Buffer Replacement second chance

LRU is nice, but the LRU list is a hot spot. Idea: use a simpler mechanism to simulate LRU • one bit per page • bit is set when page is unfixed • when replacing, replace pages with unset bit • set bits are cleared during the process • strategy is called “second chance” or “clock

Buffer Replacement 2Q

Maintain not one queue but two • many pages are referenced only once • some pages are hot and reference frequently • maintain a separate list for those

maintain all pages in FIFO queue
when a page is references again that is currently in FIFO, move it into an LRU queue
prefer evicting from FIFO Hot pages are in LRU, read-once pages in FIFO. Good strategy for common DBMS operations.

Buffer Replacement Hints

Application knowledge can help buffer replacement • 2Q tries to recognize read-once pages • these occur when scanning over data • but the DBMS knows this anyway! • it could therefore give hints when unfixing • e.g., will-need, or will-not-need (changes placement in queue)

Segments

While page granularity is fine for I/O, it is somewhat unwieldy • most structures within a DBMS span multiple pages • relations, indexes, free space management, etc. • convenient to treat these as one entity • all DBMS pages are partitioned into sets of pages Such a set of pages is called a segment. Conceptually similar to a file or virtual memory.

Shadow Paging

uses a page table, dirty pages are stored in a shadow copy.

Advantages: • the clean data is always available on disk • greatly simplified recovery • can be used for transaction isolation, too Disadvantages: • complicates the page access logic • destroys data locality Nowadays rarely used in disk-based systems

Delta Files

Similar idea to shadow paging: • on change pages are copied to a separate file • a copied page can be changed in-place • on commit discard the file, on abort copy back Can be implemented in two flavors: • store a clean copy in the delta • store the dirty data in the delta Both have pros and cons.

Delta files have some advantages over shadow paging: • preserve data locality • no mixture of clean and dirty pages Disadvantages: • cause more I/O • abort (or commit) becomes expensive • keeping track of delta pages is non-trivial Still, often preferable over shadow paging.

数据结构

空间管理
- Free Space Inventory
- slot page管理
  - 多个事务操作同一个page引发冲突 -> TID直接放到slot里
数据落盘，数据怎么存？TLV？硬编码？padding？
- 非常规数据
  - 如何表达NULL？
- 压缩问题
  - 单条压缩，访问性能差，有依赖问题，dict 压缩可行
- 长数据问题，改成指针指向文件
  - BLOB类型的优化，用户可能会乱用，所以你得优化BLOB短的场景
索引落地
- b tree -> b+tree
  - 查找二分优势

slot page定义

Header:

LSN for recovery
slotCount number of used slots
firstFreeSlot to speed up locating free slots
dataStart lower end of the data
freeSpace space that would be available after compactification

Note: a slotted page can contain hundreds of entries! Requires some care to get good performance.

特殊slot free slot offset = 0 len=0 空slot offset > 0 len=0

record格式 int int len str int len str， TLV