rockset是如何使用rocksdb的


20211210 他们的rocksdb开源了。有时间看看 https://github.com/rockset/rocksdb-cloud

rockset 是一个db服务提供商,他们用rocksdb来实现converged indexing 我也不明白是什么意思,在参考链接2有介绍,大概就是有一个文档,保存成行,成列,成index,他们大量采用的rocksdb

架构图是这样的

用户创建一个表会分成N个分片,每个分片有两个以上副本,每个分片副本放在一个rocksdb 叶节点上,每个叶节点有很多表的众多副本(他们的线上环境有一百多个),一个叶节点的一个分片副本有一个rocksdb实例,更多细节看参考链接34

下面是他们的优化手段

rocksdb-cloud

rocksdb本身是嵌入式数据存储,本身不高可用,Rockset做了rocksdb-cloud,基于S3来实现高可用

禁止WAL

架构本身有分布式日志存储来维护日志,不需要rocksdb本身的wal

Writer Rate Limit 写速率

叶节点接受查询和写,rockset能接受/ 容忍大量写导致查的高延迟,但是,还是想尽可能的让查的能力更平稳一些,所以限制了rocksdb实例的写速率,限制了并发写的线程数,降低写导致的查询延迟

在限制写的同时,也要让LSM 更平衡和以及主动触发rocksdb的stall机制,(?rocksdb原生没有,rockset自己实现的。rockset也要实现从应用到rocksdb端的限流

Sorted Write Batch

如果组提交是排好序的,并发写会更快,应用上层写的时候会自己排序

Dynamic Level Target Sizes

涉及到rocksdb compactiong策略,level compaction,本层文件大小没达到上限是不会做compact的,每层都是十倍放大,空间放大非常恐怖,见参考链接5描述,为了避免这个,上限大小编程动态的了,这样避免空间放大

AdvancedColumnFamilyOptions::level_compaction_dynamic_level_bytes = true
Shared Block Cache

这个是经验了,一个应用内,共用同一个blockcache,这样内存利用更可观

rockset使用25% 的内存来做block cache,故意留给系统page cache一部分,因为page cache保存了压缩的block,block cache保存解压过的block,page cache能降低一点系统读压力

参考链接6的论文也有介绍

L0 L1层不压缩

L0 L1层文件compact带来的优势不大,并且 L0 compact到L1层需要访问L1的文件,范围扫描也利用用不上L0的bloom filter 压缩白白浪费cpu

rocksdb 团队也推荐L0 L1不压缩,剩下的用LZ4压缩

bloom filter on key prefix

这和rockset的设计有关, 每个文档的每个字段都保存了三种方式(行,列,索引),这就是三种范围,所以查询也得三种查法,不用点查,用前缀范围查询,所以 BlockBasedTableOptions::whole_key_filtering to false,这样bloomfilter也会有问题,所以定制了ColumnFamilyOptions::prefix_extractor,针对特定的前缀来构造bloom filter

iterator freepool 迭代器池子

大量的范围查询创建大量的iterator,这是很花费性能的,所以有iterator 池,尽可能复用

综上,配置如下

Options.max_background_flushes: 2
Options.max_background_compactions: 8
Options.avoid_flush_during_shutdown: 1
Options.compaction_readahead_size: 16384
ColumnFamilyOptions.comparator: leveldb.BytewiseComparator
ColumnFamilyOptions.table_factory: BlockBasedTable
BlockBasedTableOptions.checksum: kxxHash
BlockBasedTableOptions.block_size: 16384
BlockBasedTableOptions.filter_policy: rocksdb.BuiltinBloomFilter
BlockBasedTableOptions.whole_key_filtering: 0
BlockBasedTableOptions.format_version: 4
LRUCacheOptionsOptions.capacity : 8589934592
ColumnFamilyOptions.write_buffer_size: 134217728
ColumnFamilyOptions.compression[0]: NoCompression
ColumnFamilyOptions.compression[1]: NoCompression
ColumnFamilyOptions.compression[2]: LZ4
ColumnFamilyOptions.prefix_extractor: CustomPrefixExtractor
ColumnFamilyOptions.compression_opts.max_dict_bytes: 32768

ref

  1. https://rockset.com/blog/how-we-use-rocksdb-at-rockset/
  2. https://rockset.com/blog/converged-indexing-the-secret-sauce-behind-rocksets-fast-queries/
  3. https://rockset.com/blog/aggregator-leaf-tailer-an-architecture-for-live-analytics-on-event-streams/
  4. https://www.rockset.com/Rockset_Concepts_Design_Architecture.pdf
  5. https://rocksdb.org/blog/2015/07/23/dynamic-level.html
  6. http://cidrdb.org/cidr2017/papers/p82-dong-cidr17.pdf

Read More

pure virtual method called


c++代码,程序偶尔有pure virtual method called打印

参考链接二有一个复现场景

找这种场景就可以了

真是惊奇,这种场景我以为编译器会抓到

ref

  1. https://stackoverflow.com/questions/10707286/how-to-resolve-pure-virtual-method-called
  2. https://devblogs.microsoft.com/oldnewthing/20131011-00/?p=2953

Read More

亚马逊ebs相关整理


ref

  1. https://blogs.nearsyh.me/2020/03/08/2020-03-08-Physalia/ 博客不错
  2. https://zhuanlan.zhihu.com/p/109891109
  3. https://www.usenix.org/system/files/nsdi20-paper-brooker.pdf
  4. https://cloud.tencent.com/developer/news/594892
  5. https://blog.acolyer.org/2020/03/04/millions-of-tiny-databases/

Read More

calvin相关整理


ref

  1. https://nan01ab.github.io/2019/01/Calvin.html 另外这个博客全是论文。不错
  2. https://nan01ab.github.io/2020/01/SLOG.html
  3. http://zenlife.tk/calvin.md
  4. https://www.jianshu.com/p/43909447728f?hmsr=toutiao.io&utm_medium=toutiao.io&utm_source=toutiao.io
  5. https://www.jdon.com/48815
  6. http://itindex.net/detail/52968-calvin
  7. https://www.cnblogs.com/sunyongyue/p/yale_google_calvin_brief.html
  8. https://blog.acolyer.org/2019/03/29/calvin-fast-distributed-transactions-for-partitioned-database-systems/
  9. https://blog.acolyer.org/2019/09/04/slog/

Read More

aurora相关整理


ref

  1. https://nan01ab.github.io/2017/06/Amazon-Aurora.html 另外这个博客全是论文。不错
  2. http://liuyangming.tech/05-2019/aurora.html博客不错 http://liuyangming.tech/02-2020/myrocks.html
  3. https://www.cnblogs.com/cchust/p/7476876.html
  4. http://mysql.taobao.org/monthly/2015/10/07/
  5. https://zhuanlan.zhihu.com/p/27872160
  6. https://blog.acolyer.org/2019/03/27/amazon-aurora-on-avoiding-distributed-consensus-for-i-os-commits-and-membership-changes/

Read More

(译)讨论folly的静态注入技术:如何不改接口合法的访问私有成员函数?

原文链接

这段代码是研究 folly发现的 源代码在这里

前提: 方法

class Widget {
private:
  void forbidden();
};

访问

void hijack(Widget& w) {
  w.forbidden();  // ERROR!
}
  In function 'void hijack(Widget&)':
  error: 'void Widget::forbidden()' is private
  within this context
        |     w.forbidden();
        |   

解决思路

类函数可以通过指针来调用!

比如

class Calculator {
  float current_val = 0.f;
 public:
   void clear_value() { current_val = 0.f; };
   float value() const {
     return current_val;
   };

   void add(float x) { current_val += x; };
   void multiply(float x) { current_val *= x; };
};

using Operation = void (Calculator::*)(float);
Operation op1 = &Calculator::add;
Operation op2 = &Calculator::multiply;
Calculator calc{};
(calc.*op1)(123.0f); // Calls add
(calc.*op2)(10.0f);  // Calls multiply

私有的函数通过公有函数传指针,绕过

class Widget {
 public:
  static auto forbidden_fun() {
    return &Widget::forbidden;
  }
 private:
  void forbidden();
};

void hijack(Widget& w) {
  using ForbiddenFun = void (Widget::*)();
  ForbiddenFun const forbidden_fun = Widget::forbidden_fun();

  // Calls a private member function on the Widget
  // instance passed in to the function.
  (w.*forbidden_fun)();
}

但是一般函数是不会这么设计API的,太傻逼了,那怎么搞?

通过模版实例化绕过!

The C++17 standard contains the following paragraph (with the parts of interest to us marked in bold):

17.7.2 (item 12)

The usual access checking rules do not apply to names used to specify explicit instantiations. [Note: In particular, the template arguments and names used in the function declarator (including parameter types, return types and exception specifications) may be private types or objects which would normally not be accessible and the template may be a member template or member function which would not normally be accessible.]

重点 显式实例化

最终方案敲定: 私有成员函数指针做模版的非类型模版参数(NTTP)

// The first template parameter is the type
// signature of the pointer-to-member-function.
// The second template parameter is the pointer
// itself.
template <
  typename ForbiddenFun,
  ForbiddenFun forbidden_fun
>
struct HijackImpl {
  static void apply(Widget& w) {
    // Calls a private method of Widget
    (w.*forbidden_fun)();
  }
};

// Explicit instantiation is allowed to refer to
// `Widget::forbidden` in a scope where it's not
// normally permissible.
template struct HijackImpl<
  decltype(&Widget::forbidden),
  &Widget::forbidden
>;

void hijack(Widget& w) {
  HijackImpl<decltype(&Widget::forbidden), &Widget::forbidden>::apply(w);
}

但是还是报错,理论上可行,但实际上还是会提示私有,原因在于HijackImpl不是显式实例化

freind封装一层调用 + 显式实例化

// HijackImpl is the mechanism for injecting the
// private member function pointer into the
// hijack function.
template <
  typename ForbiddenFun,
  ForbiddenFun forbidden_fun
>
class HijackImpl {
  // Definition of free function inside the class
  // template to give it access to the
  // forbidden_fun template argument.
  // Marking hijack as a friend prevents it from
  // becoming a member function.
  friend void hijack(Widget& w) {
    (w.*forbidden_fun)();
  }
};
// Declaration in the enclosing namespace to make
// hijack available for name lookup.
void hijack(Widget& w);

// Explicit instantiation of HijackImpl template
// bypasses access controls in the Widget class.
template class
HijackImpl<
  decltype(&Widget::forbidden),
  &Widget::forbidden
>;

总结这几条

  • 通过显式模版实例化把私有成员函数暴露出来
  • 用成员函数的地址指针作为HijackImpl的模版参数
  • 定义hijack函数在HijackImpl内部,直接用私有成员函数指针做函数调用
  • 通过freind修饰来hijack,这样hijack就可以在外面调用里面的HijackImpl
  • 显式实例化,这样调用就可以了

还有一个最终的问题,实现和实例化都在头文件,在所有的编译单元(translation units, TU)里, 显式实例化只能是一个,否则会报mutiple 链接错误,如何保证?

folly的做法,加个匿名tag,这样每个TU的符号名都不一样,最终方案如下

namespace {
// This is a *different* type in every translation
// unit because of the anonymous namespace.
struct TranslationUnitTag {};
}

void hijack(Widget& w);

template <
  typename Tag,
  typename ForbiddenFun,
  ForbiddenFun forbidden_fun
>
class HijackImpl {
  friend void hijack(Widget& w) {
    (w.*forbidden_fun)();
  }
};

// Every translation unit gets its own unique
// explicit instantiation because of the
// guaranteed-unique tag parameter.
template class HijackImpl<
  TranslationUnitTag,
  decltype(&Widget::forbidden),
  &Widget::forbidden
>;

参考

  • The Power of Hidden Friends in C++’ posted 25 June 2019: https://www.justsoftwaresolutions.co.uk/cplusplus/hidden-friends.html
  • Dan Saks ‘Making New Friends’ https://www.youtube.com/watch?v=POa_V15je8Y ](https://www.youtube.com/watch?v=POa_V15je8Y)
  • Johannes Schaub ‘Access to private members. That’s easy!’,http://bloglitb.blogspot.com/2011/12/access-to-private-members-safer.html
  • Johannes Schaub ‘Access to private members: Safer nastiness’, posted 30 December 2011: http://bloglitb.blogspot.com/2011/12/access-to-private-members-safer.html
  • https://dfrib.github.io/a-foliage-of-folly/ 这个文章更进一步,接下来翻译这个
Read More

(转)Correctly implementing a spinlock in cpp


https://rigtorp.se/spinlock/

不多说,上代码

struct alignas(64) spinlock {
  std::atomic<bool> lock_ = {0};

  void lock() noexcept {
    for (;;) {
      // Optimistically assume the lock is free on the first try
      if (!lock_.exchange(true, std::memory_order_acquire)) {
        return;
      }
      // Wait for lock to be released without generating cache misses
      while (lock_.load(std::memory_order_relaxed)) {
        // Issue X86 PAUSE or ARM YIELD instruction to reduce contention between
        // hyper-threads
        __builtin_ia32_pause();
      }
    }
  }

  bool try_lock() noexcept {
    // First do a relaxed load to check if lock is free in order to prevent
    // unnecessary cache misses if someone does while(!try_lock())
    return !lock_.load(std::memory_order_relaxed) &&
           !lock_.exchange(true, std::memory_order_acquire);
  }

  void unlock() noexcept {
    lock_.store(false, std::memory_order_release);
  }
};

Ticket spinlocks

https://mfukar.github.io/2017/09/08/ticketspinlock.html

struct TicketSpinLock {
    /**
     * Attempt to grab the lock:
     * 1. Get a ticket number
     * 2. Wait for it
     */
    void enter() {
        const auto ticket = next_ticket.fetch_add(1, std::memory_order_relaxed);

        while (true) {
            const auto currently_serving = now_serving.load(std::memory_order_acquire);
            if (currently_serving == ticket) {
                break;
            }

            const size_t previous_ticket = ticket - currently_serving;
            const size_t delay_slots = BACKOFF_MIN * previous_ticket;

            while (delay_slots--) {
                spin_wait();
            }
        }
    }
    static inline void spin_wait(void) {
    #if (COMPILER == GCC || COMPILER == LLVM)
        /* volatile here prevents the asm block from being moved by the optimiser: */
        asm volatile("pause" ::: "memory");
    #elif (COMPILER == MVCC)
        __mm_pause();
    #endif
    }

    /**
     * Since we're in the critical section, no one can modify `now_serving`
     * but this thread. We just want the update to be atomic. Therefore we can use
     * a simple store instead of `now_serving.fetch_add()`:
     */
    void leave() {
        const auto successor = now_serving.load(std::memory_order_relaxed) + 1;
        now_serving.store(successor, std::memory_order_release);
    }

    /* These are aligned on a cache line boundary in order to avoid false sharing: */
    alignas(CACHELINE_SIZE) std::atomic_size_t now_serving = {0};
    alignas(CACHELINE_SIZE) std::atomic_size_t next_ticket = {0};
};

static_assert(sizeof(TicketSpinLock) == 2*CACHELINE_SIZE,
    "TicketSpinLock members may not be aligned on a cache-line boundary");


Read More

遇到的两个jenkins问题


傻逼jenkins

不知道平台的人把jenkins怎么了,可能是升级了。能用内置CI还是不要用第三方组件,真是闹心

  • 乱码

image-20200422170106071

不止这一个命令,git rm都会乱码,我还以为是脚本隐藏了不可见字符,改了半天啊不好使

然后猜测是有中文注释的原因,去掉,依旧不行

最后发现参考链接1 在脚本前加一行

export LANG="en_US.UTF-8"  
  • 找不到命令

image-20200422170524986

PATH被清空了。在脚本前加上PATH定义即可

export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin"

ref

  1. https://blog.csdn.net/qq_35732831/article/details/85236562
  2. https://www.cnblogs.com/weifeng1463/p/9419358.html
  3. https://testerhome.com/topics/15136

Read More


asan常见的抓错报告



asan常见的 抓错报告 编译带上 -fsanitize=address 链接带上 -lasan

global-buffer-overflow memcmp的长度可能越界

R: AddressSanitizer: global-buffer-overflow on address 0x000000a8f8ff at pc 0x7ff6eafde870 bp 0x7ffc75471220 sp 0x7ffc754709d0 READ of size 49 at 0x000000a8f8ff thread T0 #0 0x7ff6eafde86f in __interceptor_memcmp ../../../../gcc-5.4.0/libsanitizer/asan/asan_interceptors.cc:333

注意memcmp的第三个参数,取两个字符串中最小的长度

相关概念 OOB memory access

heap-buffer-overflow strlen访问内存越界

assert(n == strlen(val)); AddressSanitizer: heap-buffer-overflow

可能字符串没有分配’\0’的空间,用strlen会导致堆空间越界

AddressSanitizer: attempting to call malloc_usable_size

这个rocksdb的报错。 搜了一圈,二进制是jemalloc编的,和asan和rocksdb 有冲突产生的报错。临时禁止掉

ASAN_OPTIONS=check_malloc_usable_size=0

重编二进制,不带jemalloc,好使了

AddressSanitizer: attempting to call malloc_usable_size() for pointer which is not owned: 0x7f121aed6000
    #0 0x7f121f506990 in __interceptor_malloc_usable_size ../../../../gcc-5.4.0/libsanitizer/asan/asan_malloc_linux.cc:104
    #1 0x8c7929 in rocksdb::Arena::AllocateNewBlock(unsigned long) util/arena.cc:221
    #2 0x8c79c4 in rocksdb::Arena::AllocateFallback(unsigned long, bool) util/arena.cc:114
    #3 0x8df67a in rocksdb::LogBuffer::AddLogToBuffer(unsigned long, char const*, __va_list_tag*) util/log_buffer.cc:24
    #4 0x8df8c8 in rocksdb::LogToBuffer(rocksdb::LogBuffer*, char const*, ...) util/log_buffer.cc:88
    #5 0x749300 in rocksdb::DBImpl::FlushMemTableToOutputFile(rocksdb::ColumnFamilyData*, rocksdb::MutableCFOptions const&, bool*, rocksdb::JobContext*, rocksdb::SuperVersionContext*, rocksdb::LogBuffer*) db/db_impl_compaction_flush.cc:183
    #6 0x74c1f4 in rocksdb::DBImpl::FlushMemTablesToOutputFiles(rocksdb::autovector<rocksdb::DBImpl::BGFlushArg, 8ul> const&, bool*, rocksdb::JobContext*, rocksdb::LogBuffer*) db/db_impl_compaction_flush.cc:229
    #7 0x74d3b0 in rocksdb::DBImpl::BackgroundFlush(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::FlushReason*) db/db_impl_compaction_flush.cc:2025
    #8 0x74da4f in rocksdb::DBImpl::BackgroundCallFlush() db/db_impl_compaction_flush.cc:2059
    #9 0x8e8a27 in std::function<void ()>::operator()() const /usr/local/include/c++/5.4.0/functional:2267
    #10 0x8e8a27 in rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long) util/threadpool_imp.cc:265
    #11 0x8e8c0e in rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*) util/threadpool_imp.cc:303
    #12 0x7f121e1fb8ef in execute_native_thread_routine ../../../../../gcc-5.4.0/libstdc++-v3/src/c++11/thread.cc:84
    #13 0x7f121dd19dc4 in start_thread (/lib64/libpthread.so.0+0x7dc4)
    #14 0x7f121da477fc in __clone (/lib64/libc.so.6+0xf67fc)

AddressSanitizer can not describe address in more detail (wild memory access suspected).
SUMMARY: AddressSanitizer: bad-malloc_usable_size ../../../../gcc-5.4.0/libsanitizer/asan/asan_malloc_linux.cc:104 __interceptor_malloc_usable_size
Thread T2 created by T0 here:
    #0 0x7f121f4a80d4 in __interceptor_pthread_create ../../../../gcc-5.4.0/libsanitizer/asan/asan_interceptors.cc:179
    #1 0x7f121e1fba32 in __gthread_create /home/vdb/gcc-5.4-build/x86_64-unknown-linux-gnu/libstdc++-v3/include/x86_64-unknown-linux-gnu/bits/gthr-default.h:662
    #2 0x7f121e1fba32 in std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>, void (*)()) ../../../../../gcc-5.4.0/libstdc++-v3/src/c++11/thread.cc:149

ref

  • 这里有建议不要使用memcmp的讨论,还是怕越界 https://github.com/cesanta/mongoose/issues/564
  • https://github.com/pcrain/slippc/issues/16 一个global buffer overflow case

Read More

^