简介:

1、MHA简介

MHA介绍

MHA(Master High Availability)目前在MySQL高可用方面是一个相对成熟的解决方案,它由日本DeNA公司youshimaton(现就职于Facebook公司)开发,是一套优秀的作为MySQL高可用性环境下故障切换和主从提升的高可用软件。在MySQL故障切换过程中,MHA能做到在0~30秒之内自动完成数据库的故障切换操作,并且在进行故障切换的过程中,MHA能在最大程度上保证数据的一致性,以达到真正意义上的高可用。

MHA组成

该软件由两部分组成:MHA Manager(管理节点)和MHA Node(数据节点)

MHA Manager可以单独部署在一台独立的机器上管理多个master-slave集群,也可以部署在一台slave节点上。

MHA Node运行在每台MySQL服务器上,MHA Manager会定时探测集群中的master节点,当master出现故障时,它可以自动将最新数据的slave提升为新的master,然后将所有其他的slave重新指向新的master。整个故障转移过程对应用程序完全透明。

在MHA自动故障切换过程中,MHA试图从宕机的主服务器上保存二进制日志,最大程度的保证数据的不丢失,但这并不总是可行的。例如,如果主服务器硬件故障或无法通过ssh访问,MHA没法保存二进制日志,只进行故障转移而丢失了最新的数据。使用MySQL 5.5的半同步复制,可以大大降低数据丢失的风险。MHA可以与半同步复制结合起来。如果只有一个slave已经收到了最新的二进制日志,MHA可以将最新的二进制日志应用于其他所有的slave服务器上,因此可以保证所有节点的数据一致性。

目前MHA主要支持一主多从的架构,要搭建MHA,要求一个复制集群中必须最少有三台数据库服务器,一主二从,即一台充当master,一台充当备用master,另外一台充当从库,因为至少需要三台服务器。我们自己使用其实也可以使用1主1从,但是master主机宕机后无法切换,以及无法补全binlog。master的mysqld进程crash后,还是可以切换成功,以及补全binlog的。

MHA工作原理

(1)从宕机崩溃的master保存二进制日志事件(binlog events);

(2)识别含有最新更新的slave;

(3)应用差异的中继日志(relay log)到其他的slave;

(4)应用从master保存的二进制日志事件(binlog events);

(5)提升一个slave为新的master;

(6)使其他的slave连接新的master进行复制;


1.4 MHA工具

MHA软件由两部分组成,Manager工具包和Node工具包

Manager 节点工具包

masterha_check_ssh:MHA 依赖的 ssh 环境监测工具

masterha_check_repl:MYSQL 主从复制环境检测工具(slave状态、文件配置)

masterga_manager:MHA 服务主程序

masterha_check_status:MHA 运行状态探测工具

masterha_master_monitor:MYSQL master 节点可用性监测工具

masterha_master_swith:master 节点切换工具(自动或者手动)

masterha_conf_host:添加或删除配置的节点

masterha_stop:关闭 MHA 服务的工具


Node 节点工具包

save_binary_logs:保存和复制 master 的二进制日志

apply_diff_relay_logs:识别差异的中继日志事件并应用于其他slave

purge_relay_logs:清除中继日志(不会阻塞 SQL 线程)


自定义扩展

secondary_check_script:通过多条网络路由检测 master 的可用性;

master_ip_failover_script:更新 application 使用的 masterip;

report_script:发送报告;

init_conf_load_script:加载初始配置参数;

master_ip_online_change_script;更新 master 节点 ip 地址

2、MHA搭建

2.1 环境信息

环境名称

主机名称

Mysql版本

IP

端口

OS 版本

主库

master

Mysql8.0.30

172.16.134.24

3310

Centos 7.2

从库(MHA Manager)

Slave01

Mysql8.0.30

172.16.134.25

3310

Centos 7.2

从库

Slave02

Mysql8.0.30

172.16.134.26

3310

Centos 7.2


2.2 基础环境搭建

MHA 基础环境主从搭建见复制文档

创建监控用户(master)

Createuser 'root'@'172.16.134.%' identified by '123456';

grant all privileges on *.* to 'root'@'172.16.134.%'

flush privileges;


2.3 MHA安装

Manager 安装

yum install perl-DBD-MySQL

yum install perl-Config-Tiny perl-Log-Dispatch perl-Parallel-ForkManager perl-Time-HiRes

rpm -ivh mha4mysql-node-0.58-0.el7.centos.noarch.rpm

rpm -ivh mha4mysql-manager-0.58-0.el7.centos.noarch.rpm


Node 安装

yum install perl-DBD-MySQL -y

rpm -ivh mha4mysql-node-0.58-0.el7.centos.noarch.rpm


安装完成后/bin 下会生产对应的工具

-rwxr-xr-x 1 root root 1995 Mar 23 2018 masterha_check_repl

-rwxr-xr-x 1 root root 1779 Mar 23 2018 masterha_check_ssh

-rwxr-xr-x 1 root root 1865 Mar 23 2018 masterha_check_status

-rwxr-xr-x 1 root root 3201 Mar 23 2018 masterha_conf_host

-rwxr-xr-x 1 root root 2517 Mar 23 2018 masterha_manager

-rwxr-xr-x 1 root root 2165 Mar 23 2018 masterha_master_monitor

-rwxr-xr-x 1 root root 2373 Mar 23 2018 masterha_master_switch

-rwxr-xr-x 1 root root 5172 Mar 23 2018 masterha_secondary_check

-rwxr-xr-x 1 root root 1739 Mar 23 2018 masterha_stop


-rwxr-xr-x 1 root root 7525 Mar 23 2018 save_binary_logs

-rwxr-xr-x 1 root root 8337 Mar 23 2018 purge_relay_logs

-rwxr-xr-x 1 root root 4807 Mar 23 2018 filter_mysqlbinlog

-rwxr-xr-x 1 root root 17639 Mar 23 2018 apply_diff_relay_logs


2.4 配置互信

cd ~

ssh-keygen -t rsa #一路回车

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

chmod 600 ~/.ssh/authorized_keys

在24机器上执行以下命令

ssh 172.16.134.24 cat ~/.ssh/id_rsa.pub >>~/.ssh/authorized_keys

ssh 172.16.134.25 cat ~/.ssh/id_rsa.pub >>~/.ssh/authorized_keys

ssh 172.16.134.26 cat ~/.ssh/id_rsa.pub >>~/.ssh/authorized_keys

scp ~/.ssh/authorized_keys 172.16.134.25:~/.ssh/

scp ~/.ssh/authorized_keys 172.16.134.26:~/.ssh/


通过sshroot@172.16.134.25命令直接登录,不需要输入密码


2.5 配置MHA

1、创建MHA工作目录,并创建相关配置文件(软件包解压后目录有样例配置文件)


MHA Manage 节点上

#创建MHA配置文件目录

mkdir /etc/mha

# 创建MHA脚本目录

mkdir /etc/mha/scripts

# 创建MHA日志目录

mkdir /var/log/mha/

# 创建日志目录(app1是自定义的,因为manager可以同时管理多个集群,所以这里取了这个名)

mkdir -p /var/log/mha/app1

# 创建日志文件

touch /var/log/mha/app1/manager.log


cp /home/soft/mha4mysql-manager-0.58/samples/conf/* /etc/mha/


配置主从配置文件,candidate_master=1的意思是是否能提升为master,可以不加。

Vi /etc/mha/app1.cnf

[server default]

manager_workdir=/var/log/mha/app1

manager_log=/var/log/mha/app1/manager.log


[server1]

hostname=172.16.134.24

port=3310


[server2]

hostname=172.16.134.25

port=3310

candidate_master=1

check_repl_delay=0


[server3]

hostname=172.16.134.26

port=3310

no_master=1


配置主要的配置文件

[server default]

user=root

password=123456

repl_user=repl

repl_password=rep1123

ssh_user=root

ping_interval=1

master_binlog_dir= /home/mysql3310/binlog/

manager_workdir=/var/log/mha/app1

manager_log=/var/log/mha/app1/manager.log

master_ip_failover_script="/etc/mha/scripts/master_ip_failover"

master_ip_online_change_script="/etc/mha/scripts/master_ip_online_change"

report_script="/etc/mha/scripts/send_report"

remote_workdir=/tmp

secondary_check_script= /usr/local/bin/masterha_secondary_check -s 172.16.134.25 -s 172.16.134.26

shutdown_script=""

[server1]

hostname=172.16.134.24

port=3310


[server2]

hostname=172.16.134.25

port=3310

candidate_master=1

check_repl_delay=0


[server3]

hostname=172.16.134.26

port=3310

no_master=1


脚本配置(vip)

#!/usr/bin/env perl


# Copyright (C) 2011 DeNA Co.,Ltd.

## Note: This is a sample script and is not complete. Modify the script based on your environment.


use strict;

use warnings FATAL => 'all';


use Getopt::Long;

use MHA::DBHelper;


my (

$command, $ssh_user, $orig_master_host,

$orig_master_ip, $orig_master_port, $new_master_host,

$new_master_ip, $new_master_port, $new_master_user,

$new_master_password

);


my $vip = '172.16.134.20/25';

my $key = '0';

my $ssh_start_vip = "/sbin/ifconfig eth0:$key $vip netmask 255.255.255.128";

my $ssh_stop_vip = "/sbin/ifconfig eth0:$key down";


GetOptions(

'command=s' => \$command,

'ssh_user=s' => \$ssh_user,

'orig_master_host=s' => \$orig_master_host,

'orig_master_ip=s' => \$orig_master_ip,

'orig_master_port=i' => \$orig_master_port,

'new_master_host=s' => \$new_master_host,

'new_master_ip=s' => \$new_master_ip,

'new_master_port=i' => \$new_master_port,

'new_master_user=s' => \$new_master_user,

'new_master_password=s' => \$new_master_password,

);


exit &main();


sub main {

if ( $command eq "stop" || $command eq "stopssh" ) {


my $exit_code = 1;

eval {

print "Disabling the VIP on old master: $orig_master_host \n";

&stop_vip();

$exit_code = 0;

};

if ($@) {

warn "Got Error: $@\n";

exit $exit_code;

}

exit $exit_code;

}

elsif ( $command eq "start" ) {


my $exit_code = 10;

eval {

print "Enabling the VIP - $vip on the new master - $new_master_host \n";

&start_vip();

$exit_code = 0;

};

if ($@) {

warn $@;

exit $exit_code;

}

exit $exit_code;

}

elsif ( $command eq "status" ) {

print "Checking the Status of the script.. OK \n";

exit 0;

}

else {

&usage();

exit 1;

}

}


sub start_vip() {

`ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;

}

sub stop_vip() {

return 0 unless ($ssh_user);

`ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;

}


sub usage {

print

"Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=

port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";

}


2.6 信任检测

root@ebsproddb.ys:/etc/mha/scripts$ masterha_check_ssh --conf=/etc/mha/app1.cnf

Fri Mar 10 18:25:20 2023 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.

Fri Mar 10 18:25:20 2023 - [info] Reading application default configuration from /etc/mha/app1.cnf..

Fri Mar 10 18:25:20 2023 - [info] Reading server configuration from /etc/mha/app1.cnf..

Fri Mar 10 18:25:20 2023 - [info] Starting SSH connection tests..

Fri Mar 10 18:25:21 2023 - [debug]

Fri Mar 10 18:25:20 2023 - [debug] Connecting via SSH from root@172.16.134.24(172.16.134.24:22) to root@172.16.134.25(172.16.134.25:22)..

Fri Mar 10 18:25:20 2023 - [debug] ok.

Fri Mar 10 18:25:20 2023 - [debug] Connecting via SSH from root@172.16.134.24(172.16.134.24:22) to root@172.16.134.26(172.16.134.26:22)..

Fri Mar 10 18:25:21 2023 - [debug] ok.

Fri Mar 10 18:25:22 2023 - [debug]

Fri Mar 10 18:25:21 2023 - [debug] Connecting via SSH from root@172.16.134.25(172.16.134.25:22) to root@172.16.134.24(172.16.134.24:22)..

Fri Mar 10 18:25:21 2023 - [debug] ok.

Fri Mar 10 18:25:21 2023 - [debug] Connecting via SSH from root@172.16.134.25(172.16.134.25:22) to root@172.16.134.26(172.16.134.26:22)..

Fri Mar 10 18:25:21 2023 - [debug] ok.

Fri Mar 10 18:25:23 2023 - [debug]

Fri Mar 10 18:25:21 2023 - [debug] Connecting via SSH from root@172.16.134.26(172.16.134.26:22) to root@172.16.134.24(172.16.134.24:22)..

Fri Mar 10 18:25:21 2023 - [debug] ok.

Fri Mar 10 18:25:21 2023 - [debug] Connecting via SSH from root@172.16.134.26(172.16.134.26:22) to root@172.16.134.25(172.16.134.25:22)..

Fri Mar 10 18:25:22 2023 - [debug] ok.

Fri Mar 10 18:25:23 2023 - [info] All SSH connection tests passed successfully

检查成功

2.7 检查主从复制是否成功

root@ebsproddb.ys:/etc/mha$ masterha_check_repl --conf=/etc/mha/masterha_default.cnf

Fri Mar 10 19:18:34 2023 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.

Fri Mar 10 19:18:34 2023 - [info] Reading application default configuration from /etc/mha/masterha_default.cnf..

Fri Mar 10 19:18:34 2023 - [info] Reading server configuration from /etc/mha/masterha_default.cnf..

Fri Mar 10 19:18:34 2023 - [info] MHA::MasterMonitor version 0.58.

Fri Mar 10 19:18:35 2023 - [info] GTID failover mode = 1

Fri Mar 10 19:18:35 2023 - [info] Dead Servers:

Fri Mar 10 19:18:35 2023 - [info] Alive Servers:

Fri Mar 10 19:18:35 2023 - [info] 172.16.134.25(172.16.134.25:3310)

Fri Mar 10 19:18:35 2023 - [info] 172.16.134.26(172.16.134.26:3310)

Fri Mar 10 19:18:35 2023 - [info] Alive Slaves:

Fri Mar 10 19:18:35 2023 - [info] 172.16.134.26(172.16.134.26:3310) Version=8.0.30 (oldest major version between slaves) log-bin:enabled

Fri Mar 10 19:18:35 2023 - [info] GTID ON

Fri Mar 10 19:18:35 2023 - [info] Replicating from 172.16.134.25(172.16.134.25:3310)

Fri Mar 10 19:18:35 2023 - [info] Current Alive Master: 172.16.134.25(172.16.134.25:3310)

Fri Mar 10 19:18:35 2023 - [info] Checking slave configurations..

Fri Mar 10 19:18:35 2023 - [info] Checking replication filtering settings..

Fri Mar 10 19:18:35 2023 - [info] binlog_do_db= , binlog_ignore_db=

Fri Mar 10 19:18:35 2023 - [info] Replication filtering check ok.

Fri Mar 10 19:18:35 2023 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.

Fri Mar 10 19:18:35 2023 - [info] Checking SSH publickey authentication settings on the current master..

Fri Mar 10 19:18:35 2023 - [info] HealthCheck: SSH to 172.16.134.25 is reachable.

Fri Mar 10 19:18:35 2023 - [info]

172.16.134.25(172.16.134.25:3310) (current master)

+--172.16.134.26(172.16.134.26:3310)


Fri Mar 10 19:18:35 2023 - [info] Checking replication health on 172.16.134.26..

Fri Mar 10 19:18:35 2023 - [info] ok.

Fri Mar 10 19:18:35 2023 - [info] Checking master_ip_failover_script status:

Fri Mar 10 19:18:35 2023 - [info] /etc/mha/scripts/master_ip_failover --command=status --ssh_user=root --orig_master_host=172.16.134.25 --orig_master_ip=172.16.134.25 --orig_master_port=3310

Fri Mar 10 19:18:35 2023 - [info] OK.

Fri Mar 10 19:18:35 2023 - [warning] shutdown_script is not defined.

Fri Mar 10 19:18:35 2023 - [info] Got exit code 0 (Not master dead).


MySQL Replication Health is OK.


2.8 启动监控进程

nohup masterha_manager --conf=/etc/mha/masterha_default.cnf --remove_dead_master_conf --ignore_last_failover < /dev/null > /var/log/mha/app1/manager.log &


查看状态

root@ebsproddb.ys:/etc/mha$ masterha_check_status --conf=/etc/mha/masterha_default.cnf

masterha_default (pid:14977) is running(0:PING_OK), master:172.16.134.24


参数说明:

--remove_dead_master_conf 该参数代表当发生主从切换后,老的主库的ip将会从配置文件中移除

--manger_log 日志存放位置

--ignore_last_failover 在缺省情况下,如果MHA检测到连续发生宕机,且两次宕机间隔不足8小时的话,则不会进行Failover,之所以这样限制是为了避免ping-pong效应。该参数代表忽略上次MHA触发切换产生的文件,默认情况下,MHA发生切换后会在日志目录,也就是上面我设置的/data产生app1.failover.complete文件,下次再次切换的时候如果发现该目录下存在该文件将不允许触发切换,除非在第一次切换后收到删除该文件,为了方便,这里设置为--ignore_last_failover


停止MHA监控

masterha_stop --conf=/etc/mha/masterha_default.cnf


3、故障切换演练

3.1 failover切换演练

场景一、自动failover,主从正常,从IO,SQL进程down掉,主库down掉


  1. slave01停止IO、SQL线程
  2. 模拟主库down机

3.1.1 slave01停止复制

Mysql>stop replica;


3.1.2 主库down机

Pkill-9 mysqld



3.1.3 观察maanger日志

Master 172.16.134.24(172.16.134.24:3310) is down!


Check MHA Manager logs at ebsproddb.ys:/var/log/mha/app1/manager.log for details.


Started automated(non-interactive) failover.

Invalidated master IP address on 172.16.134.24(172.16.134.24:3310)

None of existing slaves matches as a new master. Maybe preferred node is misconfigured or all slaves are too far behind.

Got Error so couldn't continue failover from here.

Tue Mar 14 10:56:53 2023 - [info] Sending mail..

Unknown option: conf

tail: manager.log: file truncated

Thu Mar 16 19:32:14 2023 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.

Thu Mar 16 19:32:14 2023 - [info] Reading application default configuration from /etc/mha/masterha_default.cnf..

Thu Mar 16 19:32:14 2023 - [info] Reading server configuration from /etc/mha/masterha_default.cnf..

Thu Mar 16 19:32:14 2023 - [info] MHA::MasterMonitor version 0.58.

Thu Mar 16 19:32:15 2023 - [info] GTID failover mode = 1

----- Failover Report -----

masterha_default: MySQL Master failover 172.16.134.24(172.16.134.24:3310) to 172.16.134.25(172.16.134.25:3310) succeeded


Master 172.16.134.24(172.16.134.24:3310) is down!


Check MHA Manager logs at ebsproddb.ys:/var/log/mha/app1/manager.log for details.


Started automated(non-interactive) failover.

Invalidated master IP address on 172.16.134.24(172.16.134.24:3310)

Selected 172.16.134.25(172.16.134.25:3310) as a new master.

172.16.134.25(172.16.134.25:3310): OK: Applying all logs succeeded.

172.16.134.25(172.16.134.25:3310): OK: Activated master IP address.

172.16.134.26(172.16.134.26:3310): OK: Slave started, replicating from 172.16.134.25(172.16.134.25:3310)

172.16.134.25(172.16.134.25:3310): Resetting slave info succeeded.

Master failover to 172.16.134.25(172.16.134.25:3310) completed successfully.




场景二、自动failover,备主延迟>100M,从库设置no_maseter


  1. slave01停止IO线程
  2. 主库开启压力测试数据
  3. 开启slave01 IO线程

4、模拟主库down机

3.1.3 slave01停止复制

Mysql>stop slave io_thread;


3.1.2 主库创建测试数据

mysqlslap --user=root --password=123456 -h 172.16.134.24 -P 3310 --auto-generate-sql --auto-generate-sql-load-type=write --concurrency=128 --number-of-queries=1000000 --create-schema=test03


3.1.3 slave01开启复制

Slave01

Mysql>start slave io_thread;


3.1.4 主库down机

Pkill-9 mysqld


3.1.6 观察maanger日志

Tue Mar 14 10:56:48 2023 - [warning] Got error on MySQL select ping: 2006 (MySQL server has gone away)

Tue Mar 14 10:56:48 2023 - [info] Executing secondary network check script: /bin/masterha_secondary_check -s 172.16.134.25 -s 172.16.134.26 --user=root --master_host=172.16.134.24 --master_ip=172.16.134.24 --master_port=3310 --master_user=root --master_password=123456 --ping_type=SELECT

Tue Mar 14 10:56:48 2023 - [info] Executing SSH check script: exit 0

Tue Mar 14 10:56:48 2023 - [info] HealthCheck: SSH to 172.16.134.24 is reachable.

Monitoring server 172.16.134.25 is reachable, Master is not reachable from 172.16.134.25. OK.

Monitoring server 172.16.134.26 is reachable, Master is not reachable from 172.16.134.26. OK.

Tue Mar 14 10:56:48 2023 - [info] Master is not reachable from all other monitoring servers. Failover should start.

----- Failover Report -----

masterha_default: MySQL Master failover 172.16.134.24(172.16.134.24:3310)


Master 172.16.134.24(172.16.134.24:3310) is down!

Check MHA Manager logs at ebsproddb.ys:/var/log/mha/app1/manager.log for details.

Started automated(non-interactive) failover.

Invalidated master IP address on 172.16.134.24(172.16.134.24:3310)

None of existing slaves matches as a new master. Maybe preferred node is misconfigured or all slaves are too far behind.

Got Error so couldn't continue failover from here.

Tue Mar 14 10:56:53 2023 - [info] Sending mail..

Unknown option: conf

结论:自动failover 失败,主从延迟>100M,另外从添加no_master,找不到新maseter,切换失败。



场景三、自动failover,备主延迟<100M


  1. slave01停止IO线程
  2. 主库开启压力测试数据
  3. 开启slave01 IO线程
  4. 模拟主库down机


结论:自动failover成功,备主延迟<100M,切换到备主成功。



场景四、自动failover, 备主延迟>100M,从库未设置no_maseter


  1. slave01停止IO线程
  2. 主库开启压力测试数据
  3. 开启slave01 IO线程

4、 模拟主库down机


结论:自动failover成功,备主延迟>100M,切换到从库成功



3.2 手动切换演练

手动failover,这种场景意味着在业务上没有启用MHA自动切换功能,当主服务器故障时,人工手动调用MHA来进行故障切换操作,具体命令如下:


3.2.1检测未开启MHA

masterha_check_status --conf=/etc/mha/masterha_default.cnf

masterha_default is stopped(2:NOT_RUNNING).


3.2.2 模拟主库down机

pkill -9 mysqld


注意:如果,MHA manager检测到没有dead的server,将报错,并结束failover

Thu Mar 16 20:05:47 2023 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln188] None of server is dead. Stop failover.

Thu Mar 16 20:05:47 2023 - [error][/usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm, ln177] Got ERROR: at /bin/masterha_master_switch line 53.


3.2.3手工切换MHA

masterha_master_switch --master_state=dead --conf=/etc/mha/masterha_default.cnf --dead_master_host=172.16.134.24 --dead_master_port=3310 --new_master_host=172.16.134.25 --new_master_port=3310 --ignore_last_failover


输出信息询问是否进行切换yes/no

3.2.4 查看切换日志

Thu Mar 16 20:14:48 2023 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.

Thu Mar 16 20:14:48 2023 - [info] Reading application default configuration from /etc/mha/masterha_default.cnf..

Thu Mar 16 20:14:48 2023 - [info] Reading server configuration from /etc/mha/masterha_default.cnf..

Thu Mar 16 20:14:48 2023 - [info] MHA::MasterFailover version 0.58.

Thu Mar 16 20:14:48 2023 - [info] Starting master failover.

Thu Mar 16 20:14:48 2023 - [info]

Thu Mar 16 20:14:48 2023 - [info] * Phase 1: Configuration Check Phase..

Thu Mar 16 20:14:48 2023 - [info]

----- Failover Report -----

masterha_default: MySQL Master failover 172.16.134.24(172.16.134.24:3310) to 172.16.134.25(172.16.134.25:3310) succeeded


Master 172.16.134.24(172.16.134.24:3310) is down!


Check MHA Manager logs at ebsproddb.ys for details.


Started manual(interactive) failover.

Invalidated master IP address on 172.16.134.24(172.16.134.24:3310)

Selected 172.16.134.25(172.16.134.25:3310) as a new master.

172.16.134.25(172.16.134.25:3310): OK: Applying all logs succeeded.

172.16.134.25(172.16.134.25:3310): OK: Activated master IP address.

172.16.134.26(172.16.134.26:3310): OK: Slave started, replicating from 172.16.134.25(172.16.134.25:3310)

172.16.134.25(172.16.134.25:3310): Resetting slave info succeeded.

Master failover to 172.16.134.25(172.16.134.25:3310) completed successfully.



3.3 在线切换演练

许多情况下,需要将现有的主服务器迁移到另外一台服务器上。比如主服务器硬件故障,RAID 控制卡需要重建,将主服务器移到性能更好的服务器上等等。维护主服务器引起性能下降, 导致停机时间至少无法写入数据。 另外, 阻塞或杀掉当前运行的会话会导致主主之间数据不一致的问题发生。 MHA 提供快速切换和优雅的阻塞写入,这个切换过程只需要 0.5-2s 的时间,这段时间内数据是无法写入的。在很多情况下,0.5-2s 的阻塞写入是可以接受的。因此切换主服务器不需要计划分配维护时间窗口。


MHA在线切换的大概过程:

1.检测复制设置和确定当前主服务器

2.确定新的主服务器

3.阻塞写入到当前主服务器

4.等待所有从服务器赶上复制

5.授予写入到新的主服务器

6.重新设置从服务器


注意,在线切换的时候应用架构需要考虑以下两个问题:


1.自动识别master和slave的问题(master的机器可能会切换),如果采用了vip的方式,基本可以解决这个问题。

2.负载均衡的问题(可以定义大概的读写比例,每台机器可承担的负载比例,当有机器离开集群时,需要考虑这个问题)


为了保证数据完全一致性,在最快的时间内完成切换,MHA的在线切换必须满足以下条件才会切换成功,否则会切换失败。

1.所有slave的IO线程都在运行

2.所有slave的SQL线程都在运行

3.所有的show slave status的输出中Seconds_Behind_Master参数小于或者等于running_updates_limit秒,如果在切换过程中不指定running_updates_limit,那么默认情况下running_updates_limit为1秒。

4.在master端,通过show processlist输出,没有一个更新花费的时间大于running_updates_limit秒。


检测MHA状态

masterha_check_status --conf=/etc/mha/masterha_default.cnf

masterha_default is stopped(2:NOT_RUNNING).


在线切换


masterha_master_switch --conf=/etc/mha/masterha_default.cnf --master_state=alive --new_master_host=172.16.134.25 --new_master_port=3310 --orig_master_is_new_slave --running_updates_limit=10000


查看切换日志

masterha_master_switch --conf=/etc/mha/masterha_default.cnf --master_state=alive --new_master_host=172.16.134.25 --new_master_port=3310 --orig_master_is_new_slave --running_updates_limit=10000

Mon Mar 20 19:14:23 2023 - [info] MHA::MasterRotate version 0.58.

Mon Mar 20 19:14:23 2023 - [info] Starting online master switch..

Mon Mar 20 19:14:23 2023 - [info]

Mon Mar 20 19:14:23 2023 - [info] * Phase 1: Configuration Check Phase..

Mon Mar 20 19:14:23 2023 - [info]

Mon Mar 20 19:16:37 2023 - [info] -- Slave switch on host 172.16.134.26(172.16.134.26:3310) succeeded.

Mon Mar 20 19:16:37 2023 - [info] Unlocking all tables on the orig master:

Mon Mar 20 19:16:37 2023 - [info] Executing UNLOCK TABLES..

Mon Mar 20 19:16:37 2023 - [info] ok.

Mon Mar 20 19:16:37 2023 - [info] Starting orig master as a new slave..

Mon Mar 20 19:16:37 2023 - [info] Resetting slave 172.16.134.24(172.16.134.24:3310) and starting replication from the new master 172.16.134.25(172.16.134.25:3310)..

Mon Mar 20 19:16:37 2023 - [info] Executed CHANGE MASTER.

Mon Mar 20 19:16:37 2023 - [info] Slave started.

Mon Mar 20 19:16:37 2023 - [info] All new slave servers switched successfully.

Mon Mar 20 19:16:37 2023 - [info]

Mon Mar 20 19:16:37 2023 - [info] * Phase 5: New master cleanup phase..

Mon Mar 20 19:16:37 2023 - [info]

Mon Mar 20 19:16:37 2023 - [info] 172.16.134.25: Resetting slave info succeeded.

Mon Mar 20 19:16:37 2023 - [info] Switching master to 172.16.134.25(172.16.134.25:3310) completed successfully.

You have mail in /var/spool/mail/root

切换成功