rocksdb delay write死锁
场景 mongorocks配合rocksdb使用,版本5.1.2是内部分支,合入了一些改动和私有开发改动,这是前提
遇到了delaywrite hang问题,所有线程在blockawait等待主写leader,主写leader失效了或者reset了或者停了,真的需要delaywrite,比如卡在compaction或者flush,没有写入线程等等。下面分析一遍rocksdb社区中遇到的deadlock
这是rocksdb write stall列出的原因 https://github.com/facebook/rocksdb/wiki/Write-Stalls
这是delayWrite部分代码
if (UNLIKELY(status.ok() && (write_controller_.IsStopped() ||
write_controller_.NeedsDelay()))) {
PERF_TIMER_STOP(write_pre_and_post_process_time);
PERF_TIMER_GUARD(write_delay_time);
// We don't know size of curent batch so that we always use the size
// for previous one. It might create a fairness issue that expiration
// might happen for smaller writes but larger writes can go through.
// Can optimize it if it is an issue.
status = DelayWrite(last_batch_group_size_, write_options);
PERF_TIMER_START(write_pre_and_post_process_time);
}
...
// REQUIRES: mutex_ is held
// REQUIRES: this thread is currently at the front of the writer queue
Status DBImpl::DelayWrite(uint64_t num_bytes,
const WriteOptions& write_options) {
uint64_t time_delayed = 0;
bool delayed = false;
{
StopWatch sw(env_, stats_, WRITE_STALL, &time_delayed);
uint64_t delay = write_controller_.GetDelay(env_, num_bytes);
if (delay > 0) {
if (write_options.no_slowdown) {
return Status::Incomplete();
}
TEST_SYNC_POINT("DBImpl::DelayWrite:Sleep");
mutex_.Unlock();
// We will delay the write until we have slept for delay ms or
// we don't need a delay anymore
const uint64_t kDelayInterval = 1000;
uint64_t stall_end = sw.start_time() + delay;
while (write_controller_.NeedsDelay()) {
if (env_->NowMicros() >= stall_end) {
// We already delayed this write `delay` microseconds
break;
}
delayed = true;
// Sleep for 0.001 seconds
env_->SleepForMicroseconds(kDelayInterval);
}
mutex_.Lock();
}
while (bg_error_.ok() && write_controller_.IsStopped()) {
if (write_options.no_slowdown) {
return Status::Incomplete();
}
delayed = true;
TEST_SYNC_POINT("DBImpl::DelayWrite:Wait");
bg_cv_.Wait();
}
}
assert(!delayed || !write_options.no_slowdown);
if (delayed) {
default_cf_internal_stats_->AddDBStats(InternalStats::WRITE_STALL_MICROS,
time_delayed);
RecordTick(stats_, STALL_MICROS, time_delayed);
}
return bg_error_;
}
分析社区的issue和合入日志
#pr 44751 和主写阻塞现象有点像
分析了这个pr,解决的问题主要是针对write option是no slowdown的,单独设置一下,保证当前write stall不要阻塞设置no slowdown的writer(直接返回status incomplete)Fix corner case where a write group leader blocked due to write stall blocks other writers in queue with WriteOptions::no_slowdown set. 主写stall住还是无法确定
这个改动也挺好玩的,见参考链接2
#issue 12973 delaywrite deadlock
这个issue是cockroachdb遇到的(这个db有机会翻翻代码研究下)
具体版本是4.9之前,使用eventlisener会有死锁的问题。根本原因在于内部没解锁导致的死锁
+ c->ReleaseCompactionFiles(status);
+ *made_progress = true;
NotifyOnCompactionCompleted(
c->column_family_data(), c.get(), status,
compaction_job_stats, job_context->job_id);
- c->ReleaseCompactionFiles(status);
- *made_progress = true;
后面还有不恰当配置 max_background_compactions=0
导致的wait, 不再赘述
#pr 18844 如果没有bg work 优化sleep时间
这个改动是优化delaywrite逻辑。当没有bg work的时候,也就是bg_compaction_scheduled_ bg_flush_scheduled_等为0就会sleep一段时间,然后进入wait,假如这段时间write_controller可以工作了,还在sleep就不太应该,改成sleep单位时间,每次重新判断条件。这个改动和deadlock无关。
注意,这里测量 write_stall_rate的方法。是个好手段
# fillrandom
# memtable size = 10MB
# value size = 1 MB
# num = 1000
# use /dev/shm
./db_bench --benchmarks="fillrandom,stats" --value_size=1048576 --write_buffer_size=10485760 --num=1000 --delayed_write_rate=XXXXX --db="/dev/shm/new_stall" | grep "Cumulative stall"
#issue 1235 hit write stall 5
这个issue提到的现象和遇到的完全一致。但是无法定位。
#pr 46156 避免manual flush意外错误造成的wait hang
这个优化是在WaitUntilFlushWouldNotStallWrites基础上的,如果bgworkstopped,有错误的话(比如db只读,肯定会错误),这次处理就返回,不stall wait。否则后台由于错误永远不触发,就会永远hang在这里。由于项目用的代码没有这个优化,理论上不会有这个问题
#pr 46117 避免manual compaction在只读模式下造成的hang
这个错误没看懂怎么就hang了,貌似是shared_ptr没释放导致的?这里以后记得研究一下
#pr 39238 enable_pipelined_write=true可能死锁 15.0修复
这个也是锁两次了,在外部解锁就可以。这个问题db_bench也会遇到
#pr 47519 file-ingest-trgger flush导致deadlock 17.2修复
这个也是WaitUntilFlushWouldNotStallWrites引入的deadlock,进入了writestall WaitForIngestFile 有个#issue 500710 解决方法就是让ingestfile跳过writestall
#pr 148011 IngestExternalFile 导致deadlock
这个原理没有看懂,以后有时间分析一下测试的代码https://github.com/facebook/rocksdb/pull/1480/files
#commit 6ea41f852708cf09d861894d33e1b65cd1d81c45 Fix deadlock when trying update options when write stalls12
这个就是防止write stall期间改动option造成triger混乱 加了个NeedFlushOrCompaction
遇到的问题还在分析中,也有可能不是rocksdb的原因。
参考
- https://github.com/facebook/rocksdb/pull/4475
- https://github.com/facebook/rocksdb/blob/master/db/db_impl_write.cc#L1242
- https://github.com/facebook/rocksdb/issues/1297
- https://github.com/facebook/rocksdb/pull/1884
- https://github.com/facebook/rocksdb/issues/1235
- https://github.com/facebook/rocksdb/pull/4615
- https://github.com/facebook/rocksdb/pull/4611/
- https://github.com/facebook/rocksdb/pull/3923
- https://github.com/facebook/rocksdb/pull/4751
- https://github.com/facebook/rocksdb/issues/5007
- https://github.com/facebook/rocksdb/pull/1480
- https://github.com/facebook/rocksdb/commit/6ea41f852708cf09d861894d33e1b65cd1d81c45