corosync是集群框架引擎程序,pacemaker是高可用集群资源管理器,crmsh是pacemaker的命令行工具。
一、NTP对时,免密钥登陆
[root@node-1 ~]# vim /etc/hosts
192.168.43.128 node-2
192.168.43.129 node-1
[root@node-1 ~]# ssh-keygen
[root@node-1 ~]# ssh-copy-id -i /root/.ssh/id_rsa root@node-2
[root@node-1 corosync]# scp /etc/hosts node-2:/etc/hosts
[root@node-1 ~]# ssh node-2
[root@node-1 ~]#  yum install ntp -y 
[root@node-2 ~]# hwclock -s  //将硬件主板时钟设为系统时钟,比ntpdate和date -s命令强多了[root@node-2 ~]# ssh-keygen
[root@node-2 ~]# ssh-copy-id -i /root/.ssh/id_rsa root@node-1
[root@node-2 ~]# ssh node-1
[root@node-2 ~]# yum install ntp -y 二、安装corosync、pacemaker
[root@node-1 corosync]# yum install corosync pacemaker -y    //centos自带源即可,也可以只安装pcs即可。
[root@node-2 ~]# yum install corosync pacemaker -y
[root@node-1 ~]# vim /etc/yum.repos.d/crm.repo
--------------------------------
[network_ha-clustering_Stable]
name=Stable High Availability/Clustering packages (CentOS_CentOS-7)
type=rpm-md
baseurl=http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/CentOS_CentOS-7/
gpgcheck=1
gpgkey=http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/CentOS_CentOS-7/repodata/repomd.xml.key
enabled=1
-------------------------------
[root@node-1 ~]# yum install crmsh -y[root@node-1 corosync]# cd /etc/corosync
[root@node-1 corosync]# cp corosync.conf.example corosync.conf
[root@node-1 corosync]# vim corosync.conf
		bindnetaddr: 192.168.43.0
service {
        var: 0
        name: pacemaker            #表示启动pacemaker
}
------------------------
corosync的节点直接需要密钥的。
[root@node-1 corosync]# mv /dev/{random,random.bak}
[root@node-1 corosync]# ln -s /dev/urandom /dev/random
[root@node-1 corosync]# corosync-keygen 
Corosync Cluster Engine Authentication key generator.
Gathering 1024 bits for key from /dev/random.
Press keys on your keyboard to generate entropy.
Writing corosync key to /etc/corosync/authkey.
[root@node-1 corosync]# scp corosync.conf authkey root@node-2:/etc/corosync/
[root@node-1 corosync]# systemctl start corosync;ssh node-2 systemctl start corosync  //两台机器同时启动corosync服务=====================
马哥运维理论:
资源管理层(pacemaker负责仲裁指定谁是活动节点、IP地址的转移、本地资源管理系统)、消息传递层负责心跳信息(heartbeat、corosync)、Resource Agent(理解为服务脚本)负责服务的启动、停止、查看状态。多个节点上允许多个不同服务,剩下的2个备节点称为故障转移域,主节点所在位置只是相对的,同样,第三方仲裁也是相对的。vote system:少数服从多数。当故障节点修复后,资源返回来称为failback,当故障节点修复后,资源仍在备用节点,称为failover。
CRM:cluster resource manager ===>pacemaker心脏起搏器,每个节点都要一个crmd(5560/tcp)的守护进程,有命令行接口crmsh和pcs(在heartbeat v3,红帽提出的)编辑xml文件,让crmd识别并负责资源服务的处理。也就是说crmsh和pcs等价。
Resource Agent,OCF(open cluster framework)
primtive:主资源,在集群中只运行一个实例。clone:克隆资源,在集群中可运行多个实例。每个资源都有一定的优先级。
无穷大+负无穷大=负无穷大。主机名要和DNS解析的名称相同才行。一、安装pcs管理工具
[root@node-1 ~]# ansible corosync -m service -a "name=pcsd state=started enabled=yes"  //下载ansible,定义主机组为corosync
[root@node-1 ~]# systemctl status pcsd ;ssh node-2 "systemctl status pcsd"
[root@node-1 ~]# ansible corosync -m shell -a "echo "passw0rd"|passwd --stdin hacluster"  ##单独创建用户,并设定密码,让用户名进行认证。
[root@node-1 ~]# pcs cluster auth node-2 node-1 ##本机的pcs客户端向pcsd的守护进程发起请求,如果向远端node-1的pcsd进行认证不通过,可能是firewalld的关系
Username: hacluster
Password: 
node-1: Authorized
node-2: Authorized
[root@node-2 yum.repos.d]# pcs cluster auth node-1 node-2  //最好进行双向认证。
Username: hacluster
Password: 
node-1: Authorized
node-2: Authorized二、建立集群
 [root@node-1 corosync]# pcs cluster setup --name mycluster node-1 node-2 --force
[root@node-2 corosync]# cat corosync.conf  //执行完创建集群的命令后,会在节点之间单独产生一个配置文件
totem {
    version: 2
    secauth: off
    cluster_name: mycluster
    transport: udpu
}nodelist {
    node {
        ring0_addr: node-1
        nodeid: 1
    }    node {
        ring0_addr: node-2
        nodeid: 2
    }
}quorum {
    provider: corosync_votequorum
    two_node: 1
}logging {
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
}解释:totem是两个节点进行心跳传播的协议,ring 0代表不需要向任何信息就能到达。
[root@node-1 ~]# pcs cluster start
[root@node-1 ~]# pcs cluster status
Cluster Status:
 Stack: unknown
 Current DC: NONE
 Last updated: Sat Oct 28 20:17:56 2017
 Last change: Sat Oct 28 20:17:52 2017 by hacluster via crmd on node-1
 2 nodes configured
 0 resources configured
PCSD Status:
  node-2: Online
  node-1: Online
[root@node-2 ~]# pcs cluster start  ##每个节点要单独启动pcsd守护进程。
Starting Cluster...
[root@node-2 ~]# corosync-cfgtool -s
Printing ring status.
Local node ID 2
RING ID 0
	id	= 192.168.43.128
	status	= ring 0 active with no faults
[root@node-2 ~]# corosync-cmapctl |grep members   ##检查当前的集群成员情况
runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(192.168.43.129) 
runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.1.status (str) = joined
runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(192.168.43.128) 
runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.2.status (str) = joined
[root@node-1 ~]# pcs status    ##DC(Designated Coordinator)的意思是说指定的协调员
每个node都有CRM,会有一个被选为DC,是整个Cluster的大脑,这个DC控制的CIB(cluster information base)是master CIB,其他的CIB都是副本
Cluster name: mycluster
WARNING: no stonith devices and stonith-enabled is not false   ##stonith没有启用隔离设备,也就是说在抢占资源的时候直接把对方给爆头
Stack: corosync
Current DC: node-1 (version 1.1.16-12.el7_4.4-94ff4df) - partition with quorum
Last updated: Sat Oct 28 20:28:01 2017
Last change: Sat Oct 28 20:18:13 2017 by hacluster via crmd on node-1
2 nodes configured
0 resources configured
Online: [ node-1 node-2 ]
No resources
Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled
[root@node-2 ~]# pcs status corosync
Membership information
----------------------
    Nodeid      Votes Name
         2          1 node-2 (local)
         1          1 node-1
[root@node-1 ~]# crm_verify -L -V  ##crm_verify命令用来验证当前的集群配置是否有错误
   error: unpack_resources:	Resource start-up disabled since no STONITH resources have been defined
   error: unpack_resources:	Either configure some or disable STONITH with the stonith-enabled option
   error: unpack_resources:	NOTE: Clusters with shared data need STONITH to ensure data integrity
Errors found during check: config not valid
[root@node-1 ~]# pcs property set stonith-enabled=false
[root@node-1 ~]# pcs property list  ##查看已经更改过的集群属性,如果是全局的,使用pcs property --all
Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: mycluster
 dc-version: 1.1.16-12.el7_4.4-94ff4df
 have-watchdog: false
 stonith-enabled: false三、安装crmsh命令行集群管理工具
[root@node-1 yum.repos.d]#  wget http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/CentOS_CentOS-7/network:ha-clustering:Stable.repo
crm(live)# configure
crm(live)configure# edit   ##编辑集群属性,类似于vim模式,修改后保存退出。crm部署web service:
	VIP:
	httpd:
两个节点安装httpd,注意,只能停止httpd服务,而不能重启,并且不能设置为开机自启动,因为resource manager会自动管理这些服务的运行或停止。
node-1和node-2均做以下步骤:
[root@node-2 ~]# systemctl start httpd
[root@node-2 ~]# echo "<h1>corosync pacemaker on the openstack</h1>" >/var/www/html/index.html
[root@node-1 ~]# systemctl start httpd   ##httpd不能够设置为enable,得靠crm自己管理
[root@node-1 ~]# echo "<h1>corosync pacemaker on the openstack</h1>" >/var/www/html/index.html
此时,可以从浏览器访问2个节点的web界面
[root@node-2 ~]# crm
crm(live)# status ##必须保证所有节点都上线,才执行那些命令
crm(live)# ra
crm(live)ra# list systemd
httpd
crm(live)ra# help info
crm(live)ra# classes
crm(live)ra# cd
crm(live)# configure
crm(live)configure# help primitive1、添加webIP资源
crm(live)ra# classes
crm(live)ra# list ocf  ##ocf是classes
crm(live)ra# info ocf:IPaddr ##IPaddr是provider
crm(live)configure# primitive WebIP ocf:IPaddr params ip=192.168.43.120
crm(live)configure# show
node 1: node-1
node 2: node-2
primitive WebIP IPaddr \
	params ip=192.168.43.120
property cib-bootstrap-options: \
	have-watchdog=false \
	dc-version=1.1.13-10.el7-44eb2dd \
	cluster-infrastructure=corosync \
	cluster-name=mycluster \
	stonith-enabled=false
crm(live)configure# verify
crm(live)configure# commit
crm(live)# status
 WebIP	(ocf::heartbeat:IPaddr):	Stopped
2、添加webservice资源
crm(live)configure# primitive WebServer systemd:httpd   ##systemd是classes命令看到的
crm(live)configure# verify
WARNING: WebServer: default timeout 20s for start is smaller than the advised 100
WARNING: WebServer: default timeout 20s for stop is smaller than the advised 100
crm(live)configure# commit3、webip和webserver绑定组资源
crm(live)configure# help group
crm(live)configure# group WebService WebIP WebServer  ##它们之间是有顺序的,IP在哪儿,webserver就在哪儿
crm(live)configure# verify
WARNING: WebServer: default timeout 20s for start is smaller than the advised 100
WARNING: WebServer: default timeout 20s for stop is smaller than the advised 100
crm(live)configure# commitcrm(live)configure# node standby   ##把当前节点设为备节点
四、如何保证某节点故障而后上线,资源不会从另一个节点转移回来?
学习文档:http://blog.51cto.com/nmshuishui/1399811 
+++++++++++++++++++++++++++++++排错笔记++++++++++++++++++++++++++
1、node-1节点执行crm status发现OFFLINE: [ node-1 node-2 ] ,node-2节点执行crm status发现Online: [ node-2 ],OFFLINE: [ node-1 ] ?
解决:NTP不对时问题
(1)[root@node-2 ~]# systemctl status pcsd;ssh node-1 "systemctl status pcsd"  ##均正常
[root@node-2 ~]# systemctl status corosync;ssh node-1 "systemctl status corosync"  ##均为active
两节点均可以ping通和互相SSH,于是查看corosync和pcsd日志,无明显error
(2)怀疑认证密钥不通过了,结果不是
[root@node-1 ~]# pcs cluster auth node-1 node-2 
node-1: Already authorized
node-2: Already authorized
[root@node-2 ~]# pcs cluster auth node-1 node-2 
node-1: Already authorized
node-2: Already authorized
(3)[root@node-1 ~]# crm status  ##原因是packmaker挂了,[root@node-1 ~]# systemctl status crm_mon 
ERROR: status: crm_mon (rc=107): Connection to cluster failed: Transport endpoint is not connected
(4)[root@node-1 ~]# systemctl status pacemaker  ##看了博客才发觉NTP又没同步过来
Active: failed (Result: exit-code)
[root@node-1 ~]#  vim /etc/ntp.conf
server 192.168.43.128  burst iburst prefer
[root@node-2 ~]# vim /etc/ntp.conf 
server 127.127.1.0
fudge 127.127.1.0 stratum 10
发现重启NTP还是没有卵用,只能date -s "23:52:10"了
[root@node-1 ~]# date ; ssh node-2 "date"
2017年 12月 01日 星期五 23:57:55 CST
2017年 12月 01日 星期五 23:57:56 CST
(5)最后,两个节点重启systemctl restart pacemaker,运行crm status,卧槽,终于Online: [ node-1 node-2 ]了。
参考文档:http://blog.51cto.com/nmshuishui/13998112、corosync服务起不来,进而导致pacemaker服务无法启动?
报错:[root@node-2 ~]# crm status
ERROR: status: crm_mon (rc=107): Connection to cluster failed: Transport endpoint is not connected
[root@node-2 ~]# systemctl status pacemaker
● pacemaker.service - Pacemaker High Availability Cluster Manager
   Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; vendor preset: disabled)
   Active: inactive (dead)
Dec 04 19:57:28 node-2 systemd[1]: Dependency failed for Pacemaker High Availability Cluster Manager.
Dec 04 19:57:28 node-2 systemd[1]: Job pacemaker.service/start failed with result 'dependency'.解决:节点更换了IP地址,忘了更新hosts文件。注意:是所有节点都要更新Hosts文件
[root@node-2 ~]# tail /var/log/cluster/corosync.log
[4577] node-2 corosyncerror   [MAIN  ] parse error in config: No interfaces defined
[4577] node-2 corosyncerror   [MAIN  ] Corosync Cluster Engine exiting with status 8 at main.c:1414.
[root@node-2 ~]# vim /etc/hosts
#添加新的IP地址和主机名即可。
[root@node-2 ~]# systemctl restart corosync
[root@node-2 ~]# systemctl restart pacemaker3、Pacemaker服务起不来?
报错:[root@node-2 ~]# systemctl status pacemaker
  Active: deactivating (stop-sigterm) since Mon 2017-12-04 21:04:44 CST; 54s ago
Dec 04 21:04:44 node-2 pengine[4880]:  warning: Processing failed op stop for WebIP on node-2: not configured (6)
Dec 04 21:04:44 node-2 pengine[4880]:    error: Preventing WebIP from re-starting anywhere: operation stop faile...d' (6
解决:WebIP这个资源进程有问题,用cleanup清理掉进程即可。
[root@node-2 ~]# crm resource cleanup WebIP
crm(live)configure# delete WebIP   ##删除一个组或资源都行
crm(live)configure# commit
[root@node-2 ~]# systemctl status pacemaker4、删掉crm(live)configure# delete WebIP,依然报出WebIP	(ocf::heartbeat:IPaddr):	 ORPHANED FAILED node-2 (unmanaged)
解决:[root@node-2 ~]# crm resource cleaup WebIP5、node-1认为node-2不在线,node-2认为node-1不在线?
报错:[root@node-2 ~]# crm status
Online: [ node-2 ]
OFFLINE: [ node-1 ]
[root@node-1 ~]# crm status
Online: [ node-1 ]
OFFLINE: [ node-2 ]未解决:两节点环境中,无法实现仲裁,那么每个节点都认为他是DC
[root@node-1 ~]# time=`date |awk '{print $5}'`;ssh node-2 date -s "$time"  ##保证远程主机跟本机时间同步
[root@node-1 ~]# date ;ssh node-2 "date"
2017年 12月 04日 星期一 21:37:33 CST
2017年 12月 04日 星期一 21:37:33 CST[root@node-2 ~]# systemctl list-unit-files|grep ntp  ##开机保持NTP服务开启
ntpd.service                                  enabled 
[root@node-2 ~]# hwclock -w  ##将当前系统时间写入BIOS6、pacemaker服务有问题,报出配置文件格式有问题
[root@node-2 ~]# systemctl status pacemaker -l
Dec 04 21:52:35 node-2 cib[6776]:    error: Completed cib_replace operation for section 'all': Update does not conform to the configured schema
解决:corosync.conf配置文件都是一个关键字,然后一个空格,一个花括符,紧接着4个空格,但是在拷贝的时候格式发生了变化,所以最好不要scp,手动改
[root@node-2 corosync]# vim corosync.conf
quorum {
        	    provider: corosync_votequorum
                 two_node: 1
}