前言
为保证系统更高的可用性,常常需要对重要的关键业务做双机热备,比如一个简单的 Web 服务需要做双机热备。
准备
参数 | 说明 |
192.168.139.87 | 主节点 |
192.168.139.88 | 备节点 |
192.168.139.118 | 虚拟IP |
实列实现的功能为:正常情况下由192.168.139.87提供服务,客户端可以根据主节点提供的VIP访问集群内的各种资源,当主节点故障时备节点可以提供自动接管主节点的IP资源,即VIP为192.168.139.118。
基础配置
1)通过yum安装pacemaker
禁用和关闭防火墙
[root@localhost ~] systemctl stop firewalld
[root@localhost ~] systemctl disable firewalld
关闭SELinux,但重启后失效
[root@localhost ~] setenforce 0
利用yum 工具安装 pacemaker
[root@localhost ~] yum install -y fence-agents-all corosync pacemaker pcs
2)两个节点配置主机名和 /etc/hosts
配置集群时,通常都会使用主机名来标识集群中的节点,因此需要修改hostname 。如果使用 DNS 解析集群中的节点解析延时会导致整个集群集群响应缓慢,因此任何集群都建议使用host 文件解析而不是DNS。
此处在192.168.139.87上做配置node1,192.168.139.88上则修改node2
[root@localhost ~] cat /etc/hostname
node1
两个节点上的修改相同
[root@node1 ~] cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.139.87 node1
192.168.139.88 node2
修改完成后重启两个节点的network服务
[root@localhost ~] systemctl restart network.service
网络服务重启后,可以通过重新登录的方式查看命令提示符的变化。
3)配置ssh 密钥访问
集群之间的通信是通过ssh 进行的,但ssh访问需要密码,因此需要配置密钥访问。
[root@node1 ~]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Created directory '/root/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:AGh0gLjuPHJIjieBJ4HSRJfjtRU3RG/5nECVGoNW2ZM root@node1
The key's randomart image is:
+---[RSA 2048]----+
|.++o+. .+=oo+.o |
|o +oo.. ..o++.E |
|.= . o.o . =+ . |
|= . . .. ..+ . |
|+. S + |
|+o. |
|*= |
|==o |
|.+. |
+----[SHA256]-----+
以上命令生成一对公匙和私匙
使用以下命令将公钥发送给node2
[root@node1 ~] ssh-copy-id -i /root/.ssh/id_rsa.pub root@node2
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
The authenticity of host 'node2 (192.168.139.88)' can't be established.
ECDSA key fingerprint is SHA256:RQsTAbyXXNJsWcT0xZBEjL3yLUWDntbDs7uJySaN2k4.
ECDSA key fingerprint is MD5:63:16:f4:1b:0f:ac:ff:53:96:65:80:15:d6:e7:93:0b.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@node2's password:
Number of key(s) added: 1
Now try logging into the machine, with: "ssh 'root@node2'"
and check to make sure that only the key(s) you wanted were added.
接下来在node2上重复上述操作,让node1 和 node2 之间均可以使用密匙访问。
4)配置集群用户
pacemaker 使用的用户名为hacluster ,软件安装时此用户已经建立,但还没设置密码,此时需要设置其密码。
修改用户hacluster 的密码
[root@node1 ~] passwd hacluster
Changing password for user hacluster.
New password:
BAD PASSWORD: The password is shorter than 8 characters
Retype new password:
passwd: all authentication tokens updated successfully.
在node1 上修改完hacluster的密码之后,还要在node2上进行相同操作。此处需要注意两个节点上的hacluster 密码应该相同。
5)配置集群节点之间的认证
接下来应该启动pcsd 服务,并配置各节点之间的认证,让节点之间可以互相通信
启动pcsd服务 并设置自启动
需要在两个节点上都开启pcsd服务
[root@node1 ~] systemctl start pcsd.service
[root@node1 ~] systemctl enable pcsd.service
Created symlink from /etc/systemd/system/multi-user.target.wants/pcsd.service to /usr/lib/systemd/system/pcsd.service.
配置节点间的认证
一下命令仅需要在node1上执行即可
[root@node1 ~] pcs cluster auth node1 node2
Username: hacluster
Password:
node1: Authorized
node2: Authorized
Pacemaker 资源配置
pacemaker 可以为多种服务提供支持,例如Apache、 Mysql、Xen等,可使用的资源类型有IP地址、文件系统、服务、fence 设备(fence设备通常称为远程控制卡,在节点失效后集群可以通过fence 设备将失效节点服务器重启)等。
1)配置Apache
在节点node1 和 node2上配置Apache
安装Apache
[root@node2 ~] yum install -y httpd
在配置文件Apache最后加入以下内容
[root@node2 ~] tail /etc/httpd/conf/httpd.conf
# Load config files in the "/etc/httpd/conf.d" directory, if any.
IncludeOptional conf.d/*.conf
#配置监听地址和服务名称
Listen 0.0.0.0:80
ServerName www.test.com
#设置服务器状态页面以便集群检测
<Location /server-status>
SetHandler server-status
Require all granted
</Location>
设置一个最简单的页面示例并测试
[root@node2 ~] echo "welcome to node2" > /var/www/html/index.html
[root@node2 ~] systemctl start httpd.service
[root@node2 ~] curl http://192.168.139.88
welcome to node2
pacemaker 可以控制httpd服务的启动和关闭,因此在node1 和 node2上配置完Apache并测试之后要关闭httpd服务
[root@node1 ~] sudo systemctl stop httpd.service
2)新建并启动集群
完成以上工作之后,就可以在节点node1上新建一个集群并启动。
新建一个名为 mycluster 的集群
集群节点包括node1 和 node2
[root@node1 ~] pcs cluster setup --name mycluster node1 node2
Destroying cluster on nodes: node1, node2...
node1: Stopping Cluster (pacemaker)...
node2: Stopping Cluster (pacemaker)...
node1: Successfully destroyed cluster
node2: Successfully destroyed cluster
Sending 'pacemaker_remote authkey' to 'node1', 'node2'
node1: successful distribution of the file 'pacemaker_remote authkey'
node2: successful distribution of the file 'pacemaker_remote authkey'
Sending cluster config files to the nodes...
node1: Succeeded
node2: Succeeded
Synchronizing pcsd certificates on nodes node1, node2...
node1: Success
node2: Success
Restarting pcsd on the nodes in order to reload the certificates...
node1: Success
node2: Success
启动集群并设置集群自启动
[root@node1 ~] pcs cluster start --all
node1: Starting Cluster (corosync)...
node2: Starting Cluster (corosync)...
node2: Starting Cluster (pacemaker)...
node1: Starting Cluster (pacemaker)...
[root@node1 ~] pcs cluster enable --all
node1: Cluster Enabled
node2: Cluster Enabled
查看集群状态
[root@node1 ~] pcs status
Cluster name: mycluster
WARNINGS:
No stonith devices and stonith-enabled is not false
Stack: corosync
Current DC: node1 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum
Last updated: Fri Apr 15 09:41:10 2022
Last change: Fri Apr 15 09:38:17 2022 by hacluster via crmd on node1
2 nodes configured
0 resource instances configured
Online: [ node1 node2 ]
No resources
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
可以看到node1 上新建集群之后,所有的设置都会同步到node2上,而在集群状态中可以看出node1 和 node2均已在线,集群使用的服务都已激活并启动。
3)为集群添加资源
从第(2)步集群状态中的 “No resources” 中可以看到集群还没有任何资源,接下来为集群添加VIP和服务。
添加一个名为VIP的IP地址资源
使用 heartbeat 作为心跳检测
集群每隔30s检查该资源一次
[root@node1 ~] pcs resource create VIP ocf:heartbeat:IPaddr2 ip=192.168.138.118 cidr_netmask=24 nic=ens33 op monitor interval=30s
添加一个名为 web 的 Apache资源
检查该资源通过访问 https://127.0.0.1/server-status 来实现
[root@node1 ~] pcs resource create web ocf:heartbeat:apache configfile=/etc/httpd/conf/httpd.conf statusurl="http://127.0.0.1/server-status" op monitor interval=30s
查看集群状态
[root@node1 ~] pcs status
Cluster name: mycluster
WARNINGS:
No stonith devices and stonith-enabled is not false
Stack: corosync
Current DC: node1 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum
Last updated: Fri Apr 15 10:08:08 2022
Last change: Fri Apr 15 10:03:42 2022 by root via cibadmin on node1
2 nodes configured
2 resource instances configured
Online: [ node1 node2 ]
Full list of resources:
VIP (ocf::heartbeat:IPaddr2): Stopped
web (ocf::heartbeat:apache): Stopped
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
4)调整资源
添加资源后还需对资源尽享调整,让VIP和Web这两个资源“捆绑”在一起,以免出现VIP节点在节点node1上,而Apache运行在node2上的情况。另一个情况则是有可能集群先启动Apache,然后在启用VIP,而这是不正确的。
方式一:使用组的方式“捆绑”资源
将VIP和Web添加到myweb组中
[root@node1 ~] pcs resource group add myweb VIP
[root@node1 ~] pcs resource group add myweb web
方式二:使用托管约束
[root@node1 ~] pcs constraint colocation add web VIP INFINITY
(方式一和方式二执行其一即可!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!)
设置资源的启动停止顺序
先启动VIP,然后在启动web
[root@node1 ~] pcs constraint order atart VIP then start web
5)优先级
如果node1与node2的硬件配置不同,那么应该调整节点的优先级,让资源运行于硬件配置较好的服务器上,待其失效后在转移至较低配置的服务器上,这就需要配置优先级(pacemaker 中称为 Location)。
调整Location
数值越大表示优先级越高
[root@node1 ~] pcs constraint location web prefers node1=10
[root@node1 ~] pcs constraint location web prefers node2=5
[root@node1 ~] pcs property set stonith-enabled=false
[root@node1 ~] crm_simulate -sL
Current cluster status:
Online: [ node1 node2 ]
VIP (ocf::heartbeat:IPaddr2): Started node1
web (ocf::heartbeat:apache): Starting node1
Allocation scores:
pcmk__native_allocate: VIP allocation score on node1: 10
pcmk__native_allocate: VIP allocation score on node2: 5
pcmk__native_allocate: web allocation score on node1: 10
pcmk__native_allocate: web allocation score on node2: -INFINITY
提示:在本次操作中没有设置fence设备,集群在启动的时候可能会遇到一些错误,可以使用命令 pcs property set stonith-enabled=false 禁用fence设备。也可以使用Linux 上的虚拟机中使用fence设备,关于虚拟fence设备的使用方法可参考相关文档。
Pacemaker测试
经过上面的测试,pacemaker集群已经配置完成了,重新启动集群所有设置生效。
停止所有集群
[root@node1 ~] pcs cluster stop --all
node1: Stopping Cluster (pacemaker)...
node2: Stopping Cluster (pacemaker)...
node2: Stopping Cluster (corosync)...
node1: Stopping Cluster (corosync)...
启动所有集群
[root@node1 ~] pcs cluster start --all
node1: Starting Cluster (corosync)...
node2: Starting Cluster (corosync)...
node1: Starting Cluster (pacemaker)...
node2: Starting Cluster (pacemaker)...
查看集群状态
[root@node1 ~] pcs status
Cluster name: mycluster
Stack: corosync
Current DC: node2 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum
Last updated: Fri Apr 15 17:28:21 2022
Last change: Fri Apr 15 17:26:54 2022 by root via cibadmin on node1
2 nodes configured
2 resource instances configured
Online: [ node1 node2 ]
Full list of resources:
VIP (ocf::heartbeat:IPaddr2): Started node1
web (ocf::heartbeat:apache): Starting node1
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
验证VIP是否启用
[root@node1 ~] ip address show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 00:0c:29:4c:cd:2b brd ff:ff:ff:ff:ff:ff
inet 192.168.139.87/24 brd 192.168.139.255 scope global noprefixroute ens33
valid_lft forever preferred_lft forever
inet 192.168.139.118/24 brd 192.168.139.255 scope global secondary ens33
valid_lft forever preferred_lft forever
inet6 fe80::20c:29ff:fe4c:cd2b/64 scope link
valid_lft forever preferred_lft forever
验证Apache是否启动
[root@node1 ~] ps aux | grep httpd
root 17855 0.0 0.0 112808 968 pts/0 S+ 17:31 0:00 grep --color=auto httpd
[root@node1 ~] curl http://192.168.139.118
welcome to node1
启动后正常情况下VIP设置在在主机点192.168.139.87上。如主节点故障,则节点node2自动接管服务,方法是直接重启节点node1,然后观察备节点是否接管了主节点的资源。
在节点node1上执行重启操作
[root@node1 ~] reboot
[root@node2 ~] pcs status
Cluster name: mycluster
Stack: corosync
Current DC: node2 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum
Last updated: Fri Apr 15 17:29:38 2022
Last change: Fri Apr 15 17:26:54 2022 by root via cibadmin on node1
2 nodes configured
2 resource instances configured
Online: [ node1 node2 ]
Full list of resources:
VIP (ocf::heartbeat:IPaddr2): Started node2
web (ocf::heartbeat:apache): Starting node2
Failed Resource Actions:
...
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
当节点 node1故障时,节点node2 收不到心跳请求,超过设置的时间后节点node2 启用资源接管程序,上述命令输出中说明 VIP 和Web 已经被节点 node2 成功接管。如果节点 node1恢复且设置了优先给,VIP 和 Web又会重新被节点 node1接管。