集群管理要做哪些事情:
- 节点的添加。通知大家,I join the group. 引起部分hash空间的重新分布,需要做数据传输(bootstrap);什么时候,新的节点开始响应request?所有group memeber视图一致时。部分节点更新了member视图,部分节点没有更新,如果这时读写数据会有什么结果?
- 节点的删除(宕机)。原则上数据会有N个备份,一台宕机,则会要找寻下一台存放备份
- 节点重启. 不能因为重启而导致rebalancing of the partition
- 节点之间的heartbeat:检测节点的状态
- 节点之间数据的一致性:拥有相同数据备份的节点,怎样保证数据的最终一致性
- 节点视图的一致性:怎样保证节点拥有相同的member视图,比如member join 或者 leave,怎样用最少的网络代价来通知到所有节点.
原理
首先要解决的是membership的维护,没有准确的membership视图,其他一起都是扯淡
假如我们着手解决这个问题,简单的方法是
- 每个node起来时,发送一个multicast组播或者一个UDP广播. 已有node接收并更新自己的member视图。
- 已有node中address最小的一个(master node)发送member list给新的节点
- 每个节点对所有其他节点维持heartbeat,如果有节点死亡,则从list移走
如果网络数目有几百个,每个节点都需要维持几百个heartbeat连接,定时发送几百个消息。每次heartbeat的消息数为o(n^2),网络开销巨大。
Gossip Protocols
cassandra中使用的gossip protocol
A gossip-based protocol propagates membership changes and maintains an eventually consistent view of membership. Each node contacts a peer chosen at random every second and two nodes efficiently reconcile their persisted membership change histories.
illinois大学的一名学生在Adavnced Operation Systems中做了一个PPT
淘宝网的开发人员若海的介绍: Gossip简介
源头是一篇引用非常高的论文:Epidemic algorithms for replicated database maintenance
论文PPT介绍,同样来自illinois大学:Epidemics
理论支撑:Epidemic model
Single infected site eventually infects entire population of susceptible sites
In database replication, infected site is the one with the latest update, susceptible sites are those needing the update
Gossip的一个要点就是,每个node只需要定时和集群中某个node(每次随机)同步一次member视图,就能保证集群中所有节点的member视图一致. 按照Epidemic(伊波拉)理论,有一个节点被感染(新节点同该节点交互同步),最终所有的节点都会被感染.
Anti-Entropy
在谈论Gossip时,另外一个重要名词就是Anti-Entropy,在 dbthink翻译的一篇 cassandra文章中,anti-entropy和gossip分别讨论(cassandra的 gossip和 anti-entropy实现如此)。事实上论文介绍中,anti-entropy是gossip的一种实现形式.
名词解析: 在信息论中,熵是衡量信息量多少的量化指标,或者说是对某个随机变量的不确定性的衡量,变量的不确定性越大,熵也就越大,把它搞清楚所需要的信息量也就越大( 百度百科). 我理解的逆熵就是将不确定性变为确定性的过程。
every site
regularly choose another site at
random and by exchaning database contetns with it resolves any differences between the two
本文中谈及anti-entropy,将其对应为编程中一个的方法(过程)。其伪代码如下
forsome (site s in Sites) //注意这里是some不是all,不用和所有的节点都进行同步,这是gossip核心所在
resolveDifferences(localDB, s);
根据resolveDifferences的不同,anti-entropy对应如下三种形式
pull
if (localItem.timeStamp < i.timeStamp)
localItem.value = i.value; //pull,将更新从其他节点拉倒本地 - 更新本地push
if (localItem.timeStamp > i.timeStamp)
i.value = localItem.value; //push,将本地更新推送到其他节点-更新远端pull-push
if (localItem.timeStamp < i.timeStamp)
localItem.value = i.value; //pull
else if (localItem.timeStamp > i.timeStamp)
i.value = localItem.value; //push
表面上看,pull和push最终都能将更新从一个节点传播至所有节点,但二者从概率分析上传播速度有所不同,pull比push收敛更快(即更新更快的达到所有节点,详细分析见Epidemic algorithms论文)。假如是log(n)回合,则一个更新传播出去所需要的通信次数是log(n) * n,每个回合每个节点都通信一次。
Cassandra 实现
EndPointState
节点的状态信息EndPointState,每个(已知)节点一个EndPointState,保存在Gossiper.endPointStateMap_中
EndPointState
- updateTimestamp
- lisAlive
- isAGossiper
- hasToken
- HeartBeatState
- generation
- version
- ApplicatonState
- version MOVE_STATE
NORMA,Token(Serial) //initServer
BOOT,Token(Serial)//startBootstrap
NORMAL,Token(Serial) //finishBootStrapping
LEAVING,Token //startLeaving
LEFT,left,Token //leaveRing
LEFT,remove,Token //removeToken
- version LOAD-INFORMATION
diskUsage
很显然每个字段都有其用处,细节繁杂,不作深究,这里仅仅说明其中的几个字段
generation: 系统启动时,赋为当前时间(in seconds)(StorageService.initServer -> Gossiper.start)
version: 每次应用状态变化时,增1;每次heartbeat消息时,增1
ApplicationState 目前包含两个,一个是系统当下是处在normal还是boot当前(如果系统处在boot当中,不响应读写请求,可以参看TokenMetadata对所有Token的维护和使用;其他的left, leaving状态一般不会使用,比如强制某个节点退出ring时调用),另外一个是系统当下的负载信息(磁盘占用大小),load-balancer负载平衡用(另作分析)。同样,每个State有一个version。
generation + version(max) 构成同一节点两个状态的排序依据,version是heartbeat,MOVE_STATE和LOAD-INFORMATION中较大的一个version 。这两个字段可生成一个GossipDigest对象。在节点之间状态同步时,并不是将所有状态信息全部发送给对方比较,而是将每个EndPointState变为GossipDigest,节省传输数据量(?)。
member status syns 1 :(A -> B GossipDigestSynMessage)
如前介绍的Anti-Entropy,随机挑选一个(或者的)节点,将自身所有的GossipDigest(每个已知节点对应一个GossipDigest)发给对方。另外还根据一定的概率,随机挑选一个unreachable的endpoint,向其发送同步信息(如果响应,则live again)。最后,也随机挑选一个种子节点同步一下。GossipDigest被封装在GossipDigestSynMessage
//Gossiper.GossipTimerTask
Message message = makeGossipDigestSynMessage(gDigests);
/* Gossip to some random live member */
boolean gossipedToSeed = doGossipToLiveMember(message);
doGossipToUnreachableMember(message);
if (!gossipedToSeed || liveEndpoints_.size() < seeds_.size())
doGossipToSeed(message);
doStatusCheck();
member status syns 2 : (B -> A GossipDigestAckMessage)
对方节点收到sync消息后,会在GossipDigestSynVerbHandler中处理。resolveDifferences,根据GossipDigest找出哪些是要push(本地比remote新),哪些是要pull(remote比本地新),将push的EndPointState和pull的GossipDigest通过GossipDigestAckMessage消息回发给发起端.这里有一个trick是引入heartbeat的version(why)
If the max remote version is greater then we request the remote endpoint send us all the data for this endpoint with version greater than the max version number we have locally for this endpoint. If the max remote version is lesser, then we send all the data we have locally for this endpoint with version greater than the max remote version.
假如本地包含一个endpoint的version为(1,2,10), 10为heartbeat的version,在不停的增加. 另外一个remote point包含此endpoint的version为(1,2,20)。这时要求remote endpoint 发送version > 10的State,remote endpoint仅仅发送HeartbeatState,因为仅有heartbeat的version > 10。这样本地将endpoint的version更新为(1,2,20)
//Gossiper.examineGossiper
synchronized void examineGossiper(List<GossipDigest> gDigestList, List<GossipDigest> deltaGossipDigestList, Map<InetAddress, EndPointState> deltaEpStateMap)
{
for ( GossipDigest gDigest : gDigestList )
{
int remoteGeneration = gDigest.getGeneration();
int maxRemoteVersion = gDigest.getMaxVersion();
/* Get state associated with the end point in digest */
EndPointState epStatePtr = endPointStateMap_.get(gDigest.getEndPoint());
/*
Here we need to fire a GossipDigestAckMessage. If we have some data associated with this endpoint locally
then we follow the "if" path of the logic. If we have absolutely nothing for this endpoint we need to
request all the data for this endpoint.
*/
if ( epStatePtr != null )
{
int localGeneration = epStatePtr.getHeartBeatState().getGeneration();
/* get the max version of all keys in the state associated with this endpoint */
int maxLocalVersion = getMaxEndPointStateVersion(epStatePtr);
if ( remoteGeneration == localGeneration && maxRemoteVersion == maxLocalVersion )
continue;
if ( remoteGeneration > localGeneration )
{
/* we request everything from the gossiper */
requestAll(gDigest, deltaGossipDigestList, remoteGeneration);
}
if ( remoteGeneration < localGeneration )
{
/* send all data with generation = localgeneration and version > 0 */
sendAll(gDigest, deltaEpStateMap, 0);
}
if ( remoteGeneration == localGeneration )
{
/*
If the max remote version is greater then we request the remote endpoint send us all the data
for this endpoint with version greater than the max version number we have locally for this
endpoint.
If the max remote version is lesser, then we send all the data we have locally for this endpoint
with version greater than the max remote version.
*/
if ( maxRemoteVersion > maxLocalVersion )
{
deltaGossipDigestList.add( new GossipDigest(gDigest.getEndPoint(), remoteGeneration, maxLocalVersion) );
}
if ( maxRemoteVersion < maxLocalVersion )
{
/* send all data with generation = localgeneration and version > maxRemoteVersion */
sendAll(gDigest, deltaEpStateMap, maxRemoteVersion);
}
}
}
else
{
/* We are here since we have no data for this endpoint locally so request everything. */
requestAll(gDigest, deltaGossipDigestList, remoteGeneration);
}
}
}
member status syns 3 : (A -> B GossipDigestAck2Message)
发起方收到ack消息后,会在GossipDigestAckVerbHandler中处理。将push过来的状态更新到本地applyStateLocally,同时将要pull的状态封装在GossipDigestAck2Message中发给对方
member status syns 4
对方收到ack2消息后,会在GossipDigestAck2VerbHandler中处理,将pull过来的状态更新到本地applyStateLocally.
applyStateLocally将HeartBeatState和ApplicationState更新至本地,同时通知IEndPointStateChangeSubscriber.onChange(StorageService和StorageLoadBalancer,前者更新node的token,或者更新node的load)
//Gossiper.applyApplicationStateLocally
markAlive(ep, localEpStatePtr); //live
applyHeartBeatStateLocally(ep, localEpStatePtr, remoteState);/* apply ApplicationState */
applyApplicationStateLocally(ep, localEpStatePtr, remoteState);
handleNewJoin(ep, remoteState);
FailureDetector
其基本原理是如果now - last_heart_time 远远大于(根据代码中的公式,为18倍左右)以往两次heartbeat之间的时间间隔的平均值,则宣布该节点dead,通知IFailureDetectionEventListener(Gossiper.convict)
//FailureDetector.ArrivalWindow
synchronized void add(double value)
{
double interArrivalTime;
if ( tLast_ > 0L )
{
interArrivalTime = (value - tLast_);
}
else
{
interArrivalTime = Gossiper.intervalInMillis_ / 2;
}
tLast_ = value;
arrivalIntervals_.add(interArrivalTime);
}
double p(double t)
{
double mean = mean();
double exponent = (-1)*(t)/mean;
return 1 - ( 1 - Math.pow(Math.E, exponent) );
}
double phi(long tnow)
{
int size = arrivalIntervals_.size();
double log = 0d;
if ( size > 0 )
{
double t = tnow - tLast_;
double probability = p(t);
log = (-1) * Math.log10( probability );
}
return log;
}
//FailureDetector.interpret
if ( phi > phiConvictThreshold_ )
{
for ( IFailureDetectionEventListener listener : fdEvntListeners_ )
{
listener.convict(ep);
}
}
节点的启动/restart/宕机,member视图更新过程
启动时,和种子节点sync,种子节点将member视图push给新节点,同时将新节点的信息pull到本地。在下次gossip时候,种子节点和新节点都有可能将新节点的信息广播出去.
restart...
宕机...