整体分类

  • Hash
    • Linear Hashing 线性哈希。可以动态扩容
    • 多维哈希 Multi-dimensional Hashing (mah)。可以做任意字段组合的查询
    • 一致性hash
  • 有序索引
    • B+Tree
      • B+Tree的一个节点就是一个Page, 一个Page内可以存多达500个值
      • 索引和数据的读取都以Page为单位
    • LSM-tree
    • 跳表
  • 基于签名的索引 Signature-based Indexing
    • 因为签名很小,所以可以放到单页里,根据签名过滤出数据可能出现在哪些页里进而减少需要检索的页
    • Superimposed Codewords (SIMC)
    • Concatenated Codewords (CATC)
  • 多维索引
    • kd-trees
    • Quad-trees
    • R-trees
  • 全文检索

Hash Indexes

  • 采用类似LSM-tree的内存存储最新的数据,超过阈值写入磁盘的设计
  • 磁盘存segments
  • 内存存每个segment对应的hash map, value是offset
  • 异步合并segments

LSM-tree (Log-Structured Merge-Tree)

  • used by LevelDB, RocksDB, DocDB
  • LSM tree has 2 primary components - Memtable and SSTs
  • this principle of merging and compacting sorted files are often called LSM storage engines.
  • Sorted String Table (SSTables)
    • 磁盘存segments, segment内key有序,内存存对应的sparse index
    • 最新的数据写入内存的memtable(可以是红黑树),超过阈值则写入到磁盘
    • In order to serve a read request, first try to find the key in the memtable, then in the most recent on-disk segment, then in the next-older segment, etc.
    • WAL to restore the memtable after a crash.
    • Google’s Bigtable paper [9] (which introduced the terms SSTable and memtable).
  • Performance optimizations
    • bloom filters
    • compaction
      • size-tiered
        • write optimized
        • In size-tiered compaction, newer and smaller SSTables are successively merged into older and larger SSTables.
        • 需要预留50%的磁盘空间,读取可能需要访问所有的segments
        • 实现:HBase, Cassandra
      • leveled
        • read optimized
        • In leveled compaction, the key range is split up into smaller SSTables and older data is moved into separate “levels,”
        • 每一层都是分片的,越往下的层分片越多。每一层的数据整体有序,所以读取单key的时候每一层最多只需要访问1个sgement。
        • L0 压缩到L1的时候,需要写入L1的所有segments.
        • L1, 往后就可以只访问下一层的部分segments, 从而减少了文件的读取量,并可以并发合并
        • 实现:LevelDB, RocksDB, Cassandra
        • https://cassandra.apache.org/doc/stable/cassandra/operating/compaction/lcs.html
    • 磁盘配置
      • HDD: one for the commit log and the other for everything else
      • SSD: 一个就可以

B-Trees

  • A four-level tree of 4 KB pages with a branching factor of 500 can store up to 256 TB
  • WAL used to restore the B-tree back to a con‐ sistent state
  • careful concurrency control is required
    • 可以用读写锁,写锁需要锁定所有需要修改的page后,再开始修改操作
    • 可以用Multiple-Granularity Locks尽量减少冲突

B-trees VS LSM-trees

  • LSM-trees are typically faster for writes, whereas B-trees are thought to be faster for reads
  • LSM-trees 的并发控制相对简单得多
  • B-trees attractive in databases that want to offer strong transactional semantics
  • B-tree 性能更稳定
  • B-tree 在数据库使用广泛且成熟

Value的存储

  • clustered index
    • value存到索引里
    • in MySQL’s InnoDB storage engine, the primary key of a table is always a clustered index, and
      • secondary indexes refer to the primary key (rather than a heap file location)
        • 这里location是会变化的,所以只能引用primary key
    • In SQL Server, you can specify one clustered index per table [32].
  • nonclustered index
    • value存到heap里,索引存value的引用
    • 这个方案在数据的存储管理上不灵活,比如压缩数据意味着需要跟新所有索引里的地址信息。而加一个映射层又会带来额外的开销。因此主流应该是clustered index方案?