rac集群中存在3种心跳:网络、磁盘、本地。 磁盘心跳包含两种文件:OCR(Oracle Cluster Registry,Oracle集群注册表)和VF(Voting File,表决磁盘文件)。OCR相当于集群的控制文件,用于解决健忘问题,VF用于解决脑裂问题。
Heartbeats
Heartbeat is a pooling mechanism in clustered platforms to verify if the other server participating in the cluster is alive. Oracle also uses the heartbeat mechanism to verify the health of the other nodes participating in the cluster. This helps each server in the cluster to understand the health of the other server in the cluster and take appropriate actions should polling fail. In RAC, the CSS performs polling in three different methods: • Network Heartbeat (NHB) • Disk Heartbeat (DHB) • Local Heartbeat (LHB)
Network Heartbeat (NHB)
The NHB is sent over the private interconnect. CSS sends an NHB every second from one node to all the other nodes in a cluster and receives an NHB from the remote nodes similarly every second. The NHB contains timestamp information from the local node and is used by the remote. If an acknowledgment is not received from the other node in the cluster in 30 seconds (represented by the miscount value), CSS would request a cluster reconfiguration. The reconfiguration will not always be required. CSS will verify the health and state of the node through other methods before making a decision for reconfiguration.
Disk Heartbeat (DHB)
Apart from the NHB, we use the DHB, which is required for split-brain resolution. It contains a timestamp of the local time in Unix epoch seconds as well as a millisecond timer.The DHB is the definitive mechanism to make a decision about whether a node is still alive. DHB is a mechanism where each server in the cluster will write a timestamp to the voting disk every second. In the case of NHB failure, CSS will verify the voting disk to check if the node in question has written any timestamp to the voting disk during the NHB missed timeframe to decide if cluster reconfiguration is required. Unlike the NHB, there are two parameters that drive the DHB: a “long disk I/O” (LIOT) value and a “short disk I/O” (SIOT) value. When the DHB beats are missing for too long, the node is assumed to be dead. When connectivity to the disk is lost for too long, the disk is considered offline. As listed in the preceding, the LIOT is set to 200 seconds, and SIOT is set to 27 seconds. LIOT is used to determine the disk write latency, and SIOT is used by CSS during cluster reconfiguration. SIOT is similar to the misscount value used by the NHB; however, it is computed based on the reboot time (default reboot time is set to 3 seconds). The disktimeout, misscount, and reboottime can be determined using the following crsctl command:
[root@ssky1l4p1 orarootagent_root]# crsctl get css disktimeout
CRS-4678: Successful get disktimeout 200 for Cluster Synchronization Services.
[root@ssky1l4p1 orarootagent_root]# crsctl get css reboottime
CRS-4678: Successful get reboottime 3 for Cluster Synchronization Services.
[root@ssky1l4p1 orarootagent_root]# crsctl get css misscount
CRS-4678: Successful get misscount 30 for Cluster Synchronization Services.
Based on the default values listed in the preceding, the SIOT is 27 seconds (misscount less reboottime).
Local Heartbeat (LHB)
LHB is an internal heartbeat mechanism where the message is sent to the cssdmonitor and the cssdagent to keep them informed about the health of the CSS. LHB notifications also happen every second and use and share the same thread with the NHB and DHB.