初识Ceph之3——crush算法概述

原创

qiao645 2023-08-23 10:17:44 博主文章分类：K8S ©著作权

©著作权归作者所有：来自51CTO博客作者qiao645的原创作品，请联系作者获取转载授权，否则将追究法律责任

一、背景介绍

客户端将一个数据存储到Ceph集群上，会经过两次映射，object --> PG，PG --> OSD，其中object --> PG是通过一致性哈希计算得出结果，PG --> OSD则是用到了crush算法，而ceph本身是具备冗余和故障自恢复的，该功能的实现一离不开crush算法，一句话，crush算法是将特定条件作为输入，将存储位置作为输出，如下：

输入：存储对象标识（x），crush map，placement rule
输出：一组OSD集合

客户在存储数据时，自己提供待存储对象的标识，然后通过monitor获取到crush map和placement rule，在本地进行计算，得到主OSD的位置，并向其建立连接，主OSD接收数据后自己向从OSD进行同步操作，由于ceph使用强一致性规则，待从OSD确认后，主OSD才会向客户反馈完成信息。

二、层级结构crush map

作为cluster map的子集，有时候也并不将crush map和cluster map详加区分，但这种叫法并不准确，cluster map是位于monitor之上的map集合，管理着5张map表（见下表），所以monitor也是整个ceph的元数据集合。

monitor map
PG map
crush map
OSD map
MDS map

crush map是一个倒状树结构，顶层节点成为root，向下依此可定义datacenter，room，rack，host等12种不同类型的子节点，每一个节点类型称为一个bucket，OSD作为倒状树结构最底层称为leaf节点，bucket内可以包含子bucket和leaf，新版本取消了leaf节点的说法，将OSD也定义为bucket，只是OSD bucket不能再包含子bucket，例图所示：

初识Ceph之3——crush算法概述_ceph

三、归置规则placement rule

上文定义了ceph的集群的层级结构，可以看出ceph集群的不同的故障域范围，就是通过选取不同bucket来实现的，但仅有层级结构是无法指导数据的存放，object要依照何种规则存放，则还需要placement rule来指定。从过程上描述placement rule可以分为3个阶段：

take：指定数据的访问入口，入口不必须从根开始，也可以是一个特定的bucket
select：挑选符合要求的OSD，副本池和纠删码池的算法并不一样

副本池：firstn算法
纠删码池：indep算法

emit：返回挑选结构

下面是一个ceph集群的crushmap导出结果，根据该视图，结合上文描述便于理解

[root@ceph1 ceph]# ceph osd getcrushmap -o /tmp/crushmap.bin
[root@ceph1 ceph]# crushtool -d /tmp/crushmap.bin -o crushmap.txt
[root@ceph1 ceph]# cat crushmap.txt

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host ceph1 {
        id -3           # do not change unnecessarily
        id -4 class hdd         # do not change unnecessarily
        # weight 0.049
        alg straw2
        hash 0  # rjenkins1
        item osd.0 weight 0.049
}
host ceph2 {
        id -5           # do not change unnecessarily
        id -6 class hdd         # do not change unnecessarily
        # weight 0.049
        alg straw2
        hash 0  # rjenkins1
        item osd.1 weight 0.049
}
host ceph3 {
        id -7           # do not change unnecessarily
        id -8 class hdd         # do not change unnecessarily
        # weight 0.049
        alg straw2
        hash 0  # rjenkins1
        item osd.2 weight 0.049
}
root default {
        id -1           # do not change unnecessarily
        id -2 class hdd         # do not change unnecessarily
        # weight 0.146
        alg straw2
        hash 0  # rjenkins1
        item ceph1 weight 0.049
        item ceph2 weight 0.049
        item ceph3 weight 0.049
}

# rules
rule replicated_rule {
        id 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type host
        step emit
}

rule erasure-code {
        id 1
        type erasure
        min_size 3
        max_size 4
        step set_chooseleaf_tries 5
        step set_choose_tries 100
        step take default
        step chooseleaf indep 0 type host
        step emit
}

# end crush map