前言

为保证系统更高的可用性,常常需要对重要的关键业务做双机热备,比如一个简单的 Web 服务需要做双机热备。

准备

参数

说明

192.168.139.87

主节点

192.168.139.88

备节点

192.168.139.118

虚拟IP

实列实现的功能为:正常情况下由192.168.139.87提供服务,客户端可以根据主节点提供的VIP访问集群内的各种资源,当主节点故障时备节点可以提供自动接管主节点的IP资源,即VIP为192.168.139.118。

基础配置

1)通过yum安装pacemaker
禁用和关闭防火墙

[root@localhost ~] systemctl stop firewalld
[root@localhost ~] systemctl disable firewalld

关闭SELinux,但重启后失效

[root@localhost ~] setenforce 0

利用yum 工具安装 pacemaker

[root@localhost ~] yum install -y fence-agents-all corosync pacemaker pcs

2)两个节点配置主机名和 /etc/hosts
配置集群时,通常都会使用主机名来标识集群中的节点,因此需要修改hostname 。如果使用 DNS 解析集群中的节点解析延时会导致整个集群集群响应缓慢,因此任何集群都建议使用host 文件解析而不是DNS。
此处在192.168.139.87上做配置node1,192.168.139.88上则修改node2

[root@localhost ~] cat /etc/hostname
node1

两个节点上的修改相同

[root@node1 ~] cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.139.87 node1
192.168.139.88 node2

修改完成后重启两个节点的network服务

[root@localhost ~] systemctl restart network.service

网络服务重启后,可以通过重新登录的方式查看命令提示符的变化。
3)配置ssh 密钥访问
集群之间的通信是通过ssh 进行的,但ssh访问需要密码,因此需要配置密钥访问。

[root@node1 ~]# ssh-keygen -t rsa 
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa): 
Created directory '/root/.ssh'.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:AGh0gLjuPHJIjieBJ4HSRJfjtRU3RG/5nECVGoNW2ZM root@node1
The key's randomart image is:
+---[RSA 2048]----+
|.++o+.  .+=oo+.o |
|o +oo.. ..o++.E  |
|.= . o.o .  =+ . |
|= . . ..   ..+ . |
|+.      S     +  |
|+o.              |
|*=               |
|==o              |
|.+.              |
+----[SHA256]-----+

以上命令生成一对公匙和私匙
使用以下命令将公钥发送给node2

[root@node1 ~] ssh-copy-id -i /root/.ssh/id_rsa.pub root@node2
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
The authenticity of host 'node2 (192.168.139.88)' can't be established.
ECDSA key fingerprint is SHA256:RQsTAbyXXNJsWcT0xZBEjL3yLUWDntbDs7uJySaN2k4.
ECDSA key fingerprint is MD5:63:16:f4:1b:0f:ac:ff:53:96:65:80:15:d6:e7:93:0b.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@node2's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'root@node2'"
and check to make sure that only the key(s) you wanted were added.

接下来在node2上重复上述操作,让node1 和 node2 之间均可以使用密匙访问。
4)配置集群用户
pacemaker 使用的用户名为hacluster ,软件安装时此用户已经建立,但还没设置密码,此时需要设置其密码。

修改用户hacluster 的密码
[root@node1 ~] passwd hacluster
Changing password for user hacluster.
New password: 
BAD PASSWORD: The password is shorter than 8 characters
Retype new password: 
passwd: all authentication tokens updated successfully.

在node1 上修改完hacluster的密码之后,还要在node2上进行相同操作。此处需要注意两个节点上的hacluster 密码应该相同。
5)配置集群节点之间的认证
接下来应该启动pcsd 服务,并配置各节点之间的认证,让节点之间可以互相通信

启动pcsd服务 并设置自启动
需要在两个节点上都开启pcsd服务
[root@node1 ~] systemctl start pcsd.service
[root@node1 ~] systemctl enable pcsd.service
Created symlink from /etc/systemd/system/multi-user.target.wants/pcsd.service to /usr/lib/systemd/system/pcsd.service.

配置节点间的认证
一下命令仅需要在node1上执行即可

[root@node1 ~] pcs cluster auth node1 node2
Username: hacluster
Password: 
node1: Authorized
node2: Authorized

Pacemaker 资源配置

pacemaker 可以为多种服务提供支持,例如ApacheMysqlXen等,可使用的资源类型有IP地址、文件系统、服务、fence 设备(fence设备通常称为远程控制卡,在节点失效后集群可以通过fence 设备将失效节点服务器重启)等。

1)配置Apache
在节点node1 和 node2上配置Apache

安装Apache
[root@node2 ~] yum install -y httpd
在配置文件Apache最后加入以下内容
[root@node2 ~] tail /etc/httpd/conf/httpd.conf 
# Load config files in the "/etc/httpd/conf.d" directory, if any.
IncludeOptional conf.d/*.conf
#配置监听地址和服务名称
Listen 0.0.0.0:80
ServerName www.test.com
#设置服务器状态页面以便集群检测
<Location /server-status>
	SetHandler server-status
	Require all granted
</Location>
设置一个最简单的页面示例并测试
[root@node2 ~] echo "welcome to node2" > /var/www/html/index.html
[root@node2 ~] systemctl start httpd.service
[root@node2 ~] curl http://192.168.139.88
welcome to node2

pacemaker 可以控制httpd服务的启动和关闭,因此在node1 和 node2上配置完Apache并测试之后要关闭httpd服务

[root@node1 ~] sudo systemctl stop httpd.service

2)新建并启动集群
完成以上工作之后,就可以在节点node1上新建一个集群并启动。

新建一个名为 mycluster 的集群
集群节点包括node1 和 node2 
[root@node1 ~] pcs cluster setup --name mycluster node1 node2
Destroying cluster on nodes: node1, node2...
node1: Stopping Cluster (pacemaker)...
node2: Stopping Cluster (pacemaker)...
node1: Successfully destroyed cluster
node2: Successfully destroyed cluster

Sending 'pacemaker_remote authkey' to 'node1', 'node2'
node1: successful distribution of the file 'pacemaker_remote authkey'
node2: successful distribution of the file 'pacemaker_remote authkey'
Sending cluster config files to the nodes...
node1: Succeeded
node2: Succeeded

Synchronizing pcsd certificates on nodes node1, node2...
node1: Success
node2: Success
Restarting pcsd on the nodes in order to reload the certificates...
node1: Success
node2: Success
启动集群并设置集群自启动
[root@node1 ~] pcs cluster start --all
node1: Starting Cluster (corosync)...
node2: Starting Cluster (corosync)...
node2: Starting Cluster (pacemaker)...
node1: Starting Cluster (pacemaker)...
[root@node1 ~] pcs cluster enable --all
node1: Cluster Enabled
node2: Cluster Enabled
查看集群状态
[root@node1 ~] pcs status
Cluster name: mycluster

WARNINGS:
No stonith devices and stonith-enabled is not false

Stack: corosync
Current DC: node1 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum
Last updated: Fri Apr 15 09:41:10 2022
Last change: Fri Apr 15 09:38:17 2022 by hacluster via crmd on node1

2 nodes configured
0 resource instances configured

Online: [ node1 node2 ]

No resources


Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

可以看到node1 上新建集群之后,所有的设置都会同步到node2上,而在集群状态中可以看出node1 和 node2均已在线,集群使用的服务都已激活并启动。
3)为集群添加资源
从第(2)步集群状态中的 “No resources” 中可以看到集群还没有任何资源,接下来为集群添加VIP和服务。

添加一个名为VIP的IP地址资源
使用 heartbeat 作为心跳检测
集群每隔30s检查该资源一次
[root@node1 ~] pcs resource create VIP ocf:heartbeat:IPaddr2 ip=192.168.138.118 cidr_netmask=24 nic=ens33 op monitor interval=30s
添加一个名为 web 的 Apache资源
检查该资源通过访问 https://127.0.0.1/server-status 来实现
[root@node1 ~] pcs resource create web ocf:heartbeat:apache configfile=/etc/httpd/conf/httpd.conf statusurl="http://127.0.0.1/server-status" op monitor interval=30s
查看集群状态
[root@node1 ~] pcs status
Cluster name: mycluster

WARNINGS:
No stonith devices and stonith-enabled is not false

Stack: corosync
Current DC: node1 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum
Last updated: Fri Apr 15 10:08:08 2022
Last change: Fri Apr 15 10:03:42 2022 by root via cibadmin on node1

2 nodes configured
2 resource instances configured

Online: [ node1 node2 ]

Full list of resources:

 VIP	(ocf::heartbeat:IPaddr2):	Stopped
 web	(ocf::heartbeat:apache):	Stopped

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

4)调整资源
添加资源后还需对资源尽享调整,让VIP和Web这两个资源“捆绑”在一起,以免出现VIP节点在节点node1上,而Apache运行在node2上的情况。另一个情况则是有可能集群先启动Apache,然后在启用VIP,而这是不正确的。

方式一:使用组的方式“捆绑”资源
将VIP和Web添加到myweb组中
[root@node1 ~] pcs resource group add myweb VIP
[root@node1 ~] pcs resource group add myweb web
方式二:使用托管约束
[root@node1 ~] pcs constraint colocation add web VIP INFINITY
(方式一和方式二执行其一即可!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!)
设置资源的启动停止顺序
先启动VIP,然后在启动web
[root@node1 ~] pcs constraint order atart VIP then start web

5)优先级
如果node1与node2的硬件配置不同,那么应该调整节点的优先级,让资源运行于硬件配置较好的服务器上,待其失效后在转移至较低配置的服务器上,这就需要配置优先级(pacemaker 中称为 Location)。

调整Location
数值越大表示优先级越高
[root@node1 ~] pcs constraint location web prefers node1=10
[root@node1 ~] pcs constraint location web prefers node2=5

[root@node1 ~] pcs property set stonith-enabled=false
[root@node1 ~] crm_simulate -sL

Current cluster status:
Online: [ node1 node2 ]

 VIP	(ocf::heartbeat:IPaddr2):	Started node1
 web	(ocf::heartbeat:apache):	Starting node1

Allocation scores:
pcmk__native_allocate: VIP allocation score on node1: 10
pcmk__native_allocate: VIP allocation score on node2: 5
pcmk__native_allocate: web allocation score on node1: 10
pcmk__native_allocate: web allocation score on node2: -INFINITY

提示:在本次操作中没有设置fence设备,集群在启动的时候可能会遇到一些错误,可以使用命令 pcs property set stonith-enabled=false 禁用fence设备。也可以使用Linux 上的虚拟机中使用fence设备,关于虚拟fence设备的使用方法可参考相关文档。

Pacemaker测试

经过上面的测试,pacemaker集群已经配置完成了,重新启动集群所有设置生效。

停止所有集群

[root@node1 ~] pcs cluster stop --all
node1: Stopping Cluster (pacemaker)...
node2: Stopping Cluster (pacemaker)...
node2: Stopping Cluster (corosync)...
node1: Stopping Cluster (corosync)...

启动所有集群

[root@node1 ~] pcs cluster start --all
node1: Starting Cluster (corosync)...
node2: Starting Cluster (corosync)...
node1: Starting Cluster (pacemaker)...
node2: Starting Cluster (pacemaker)...

查看集群状态

[root@node1 ~] pcs status
Cluster name: mycluster
Stack: corosync
Current DC: node2 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum
Last updated: Fri Apr 15 17:28:21 2022
Last change: Fri Apr 15 17:26:54 2022 by root via cibadmin on node1

2 nodes configured
2 resource instances configured

Online: [ node1 node2 ]

Full list of resources:

 VIP	(ocf::heartbeat:IPaddr2):	Started node1
 web	(ocf::heartbeat:apache):	Starting node1

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

验证VIP是否启用

[root@node1 ~] ip address show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
  link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
  inet 127.0.0.1/8 scope host lo
     valid_lft forever preferred_lft forever
  inet6 ::1/128 scope host 
     valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
  link/ether 00:0c:29:4c:cd:2b brd ff:ff:ff:ff:ff:ff
  inet 192.168.139.87/24 brd 192.168.139.255 scope global noprefixroute ens33
     valid_lft forever preferred_lft forever
  inet 192.168.139.118/24 brd 192.168.139.255 scope global secondary ens33
     valid_lft forever preferred_lft forever
  inet6 fe80::20c:29ff:fe4c:cd2b/64 scope link 
     valid_lft forever preferred_lft forever

验证Apache是否启动

[root@node1 ~] ps aux | grep httpd
root      17855  0.0  0.0 112808   968 pts/0    S+   17:31   0:00 grep --color=auto httpd
[root@node1 ~] curl http://192.168.139.118
welcome to node1

启动后正常情况下VIP设置在在主机点192.168.139.87上。如主节点故障,则节点node2自动接管服务,方法是直接重启节点node1,然后观察备节点是否接管了主节点的资源。

在节点node1上执行重启操作

[root@node1 ~] reboot
[root@node2 ~] pcs status
Cluster name: mycluster
Stack: corosync
Current DC: node2 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum
Last updated: Fri Apr 15 17:29:38 2022
Last change: Fri Apr 15 17:26:54 2022 by root via cibadmin on node1

2 nodes configured
2 resource instances configured

Online: [ node1 node2 ]

Full list of resources:

 VIP	(ocf::heartbeat:IPaddr2):	Started node2
 web	(ocf::heartbeat:apache):	Starting node2

Failed Resource Actions:
...

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

当节点 node1故障时,节点node2 收不到心跳请求,超过设置的时间后节点node2 启用资源接管程序,上述命令输出中说明 VIP 和Web 已经被节点 node2 成功接管。如果节点 node1恢复且设置了优先给,VIP 和 Web又会重新被节点 node1接管。