前言
在neutron支持DVR(分布式路由)后,东西向流量、绑定浮动IP的南北流量都可不再经过网络节点,这大大降低了网络节点的负担,但以下两类流量依然需要经过网络节点:
- 未绑定fip的南北流量
- 即使vm绑定了fip,但计算节点无外网网络(agent_mode = dvr_no_external)
这两类流量都会影响到vm公网访问,必须要充分保障网络节点高可用
实现原理
网络节点的高可用主要通过keepalived和Router Redundancy Protocol (VRRP)协议实现主备路由间的切换,正常情况下由主(master)路由提供网络服务,在主路由上的keepalived会定期(如每隔5秒)向其他VRRP路由发送心跳数据包,当在一定时间内备(slave)路由上的keepalived未收到主(master)路由上的心跳数据包或主路由进入故障状态时,备(slave)路由会将主路由上的IP等信息配置到备路由的接口上,继续对外提供服务。
说明
如果存在多个备(slave)路由,keepalived会根据每个备(slave)路由上配置的优先级进行选择,如果优先级相同则选择IP地址最高的备路由
心跳数据包使用的网络平面与配置的租户网络类型(tenant_network_types)有关,如vxlan
架构图
如下图所示,主备路由分布在两个网络节点上,心跳数据包通过vxlan网络(VNI101)
核心配置
网络节点:/etc/neutron/neutron.conf:
router_distributed = True
l3_ha = True(3层路由高可用)
l3_ha_net_cidr = 169.254.192.0/18(心跳网络地址)
max_l3_agents_per_router = 3
min_l3_agents_per_router = 2
网络节点:/etc/neutron/l3_agent.ini
agent_mode = dvr_snat (设置为分布式路由南北流量snat的服务节点)
ha_vrrp_auth_password = xxxxxx
ha_vrrp_health_check_interval = 5 (vrrp心跳检测时间间隔)
实验验证
通过公网访问VM,验证主路由故障时的情形
- 环境信息
配置 | 说明 |
网络节点 | 2个 |
network-server1 | 网络节点1 |
network-server2 | 网络节点2 |
vm固定IP | 192.168.2.156 |
vm浮动IP | 60.191.72.103 |
- 主(master)路由节点信息
[root@network-server1 ~]# ip netns exec snat-df301506-a68e-47b2-be3d-73a34b40b1bf ip a
...
132: ha-cdd81756-b0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN qlen 1000
link/ether fa:16:3e:e7:22:20 brd ff:ff:ff:ff:ff:ff
inet 169.254.192.3/18 brd 169.254.255.255 scope global ha-cdd81756-b0
valid_lft forever preferred_lft forever
inet 169.254.0.1/24 scope global ha-cdd81756-b0
133: sg-5579abe9-04: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN qlen 1000
link/ether fa:16:3e:1a:5b:f7 brd ff:ff:ff:ff:ff:ff
inet 192.168.2.159/24 scope global sg-5579abe9-04
valid_lft forever preferred_lft forever
inet6 fe80::f816:3eff:fe1a:5bf7/64 scope link nodad
valid_lft forever preferred_lft forever
134: qg-d807dcac-84: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
link/ether fa:16:3e:54:be:f4 brd ff:ff:ff:ff:ff:ff
inet 60.191.72.103/32 scope global qg-d807dcac-84 (该地址为vm的fip地址)
valid_lft forever preferred_lft forever
inet 60.191.72.109/26 scope global qg-d807dcac-84 (该地址为路由网关地址)
valid_lft forever preferred_lft forever
inet6 fe80::f816:3eff:fe54:bef4/64 scope link
valid_lft forever preferred_lft forever
解析:
snat namespace为云内网络与外部网络的通道,需存在内、外部网络接口
192.168.2.159/24:云内网络接口,访问外部网络时,云内网络网关(如192.168.1.1)收到数据包后会将数据包转发至网络节点该地址
60.191.72.103/32:vm的fip地址
60.191.72.109/26:无fip时snat地址
- 备(slave)路由节点信息
...
119: ha-c74f8508-af: <BROADCAST,MULTICAST> mtu 1450 qdisc noqueue state DOWN qlen 1000
link/ether fa:16:3e:bf:f1:5a brd ff:ff:ff:ff:ff:ff
inet 169.254.192.11/18 brd 169.254.255.255 scope global ha-c74f8508-af
valid_lft forever preferred_lft forever
120: sg-5579abe9-04: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN qlen 1000
link/ether fa:16:3e:1a:5b:f7 brd ff:ff:ff:ff:ff:ff
121: qg-d807dcac-84: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
link/ether fa:16:3e:54:be:f4 brd ff:ff:ff:ff:ff:ff
解析:
备(slave)路由节点snat namespace中仅存在心跳IP 169.254.192.11,当前不会提供路由服务
- keepalived 信息
953896 ? S 0:00 /usr/bin/python2 /bin/neutron-keepalived-state-change --router_id=df301506-a68e-47b2-be3d-73a34b40b1bf --namespace=snat-df301506-a68e-47b2-be3d-73a34b40b1bf --conf_dir=/var/lib/neutron/ha_confs/df301506-a68e-47b2-be3d-73a34b40b1bf --monitor_interface=ha-cdd81756-b0 --monitor_cidr=169.254.0.1/24 --pid_file=/var/lib/neutron/external/pids/df301506-a68e-47b2-be3d-73a34b40b1bf.monitor.pid --state_path=/var/lib/neutron --user=992 --group=989
1061032 ? Ss 0:04 keepalived -P -f /var/lib/neutron/ha_confs/df301506-a68e-47b2-be3d-73a34b40b1bf/keepalived.conf -p /var/lib/neutron/ha_confs/df301506-a68e-47b2-be3d-73a34b40b1bf.pid -r /var/lib/neutron/ha_confs/df301506-a68e-47b2-be3d-73a34b40b1bf.pid-vrrp -D
1061033 ? S 0:19 \_ keepalived -P -f /var/lib/neutron/ha_confs/df301506-a68e-47b2-be3d-73a34b40b1bf/keepalived.conf -p /var/lib/neutron/ha_confs/df301506-a68e-47b2-be3d-73a34b40b1bf.pid -r /var/lib/neutron/ha_confs/df301506-a68e-47b2-be3d-73a34b40b1bf.pid-vrrp -D
解析:
keepalived会启动一个进程监控该路由,配置文件为/var/lib/neutron/ha_confs/df301506-a68e-47b2-be3d-73a34b40b1bf/keepalived.conf ,该配置文件中详细配置了如健康检查脚本、优先级、切换时需要配置的ip、路由等网络信息
[root@network-server1 df301506-a68e-47b2-be3d-73a34b40b1bf]# cat keepalived.conf
global_defs {
notification_email_from neutron@openstack.local
router_id neutron
}
vrrp_script ha_health_check_1 {
script "/var/lib/neutron/ha_confs/df301506-a68e-47b2-be3d-73a34b40b1bf/ha_check_script_1.sh"
interval 5
fall 2
rise 2
}
vrrp_instance VR_1 {
state BACKUP
interface ha-cdd81756-b0
virtual_router_id 1
priority 50
garp_master_delay 60
nopreempt
advert_int 2
authentication {
auth_type PASS
auth_pass xxxxxx
}
track_interface {
ha-cdd81756-b0
}
virtual_ipaddress {
169.254.0.1/24 dev ha-cdd81756-b0
}
virtual_ipaddress_excluded {
192.168.2.159/24 dev sg-5579abe9-04
60.191.72.103/32 dev qg-d807dcac-84
60.191.72.109/26 dev qg-d807dcac-84
fe80::f816:3eff:fe1a:5bf7/64 dev sg-5579abe9-04 scope link
fe80::f816:3eff:fe54:bef4/64 dev qg-d807dcac-84 scope link
}
virtual_routes {
0.0.0.0/0 via 60.191.72.65 dev qg-d807dcac-84
124.160.117.64/26 dev qg-d807dcac-84 scope link
}
track_script {
ha_health_check_1
}
}
- 健康检查脚本
keepalived配置文件中配置了健康检查脚本
[root@network-server1]# cat /var/lib/neutron/ha_confs/df301506-a68e-47b2-be3d-73a34b40b1bf/ha_check_script_1.sh
#!/bin/bash -eu
ip a | grep 192.168.2.159 || exit 0
ping -c 1 -w 1 60.191.72.65 1>/dev/null || exit 1
解析:
ip a | grep 192.168.2.159 || exit 0 (如果找到这个IP地址继续执行程序,如果找不到这个IP则执行exit 0)
ping -c 1 -w 1 60.191.72.65 1>/dev/null || exit 1 (如果无法ping通外部网关地址则执行exit 1)
结合以上两个命令,当snat namespace中sg-5579abe9-04口配置了192.168.2.159口但无法ping通外部网关地址时,健康检查返回异常(exit 1)。
- 触发故障
通过移除外网网卡,验证网络连通性
[root@network-server1 ~]# ovs-vsctl del-port br-bond2 bond2
ping VM返回信息,中断约3秒钟
来自 60.191.72.103 的回复: 字节=32 时间=12ms TTL=53
来自 60.191.72.103 的回复: 字节=32 时间=5ms TTL=53
请求超时。
请求超时。
请求超时。
来自 60.191.72.103 的回复: 字节=32 时间=6ms TTL=53
来自 60.191.72.103 的回复: 字节=32 时间=5ms TTL=53
- 故障后主备路由
keepalived切换日志
May 25 15:03:55 network-server1 Keepalived_vrrp[982249]: /var/lib/neutron/ha_confs/e762fefa-ddc8-4d03-9f31-9f1c613d85a3/ha_check_script_1.sh exited with status 1
May 25 15:03:56 network-server1 Keepalived_vrrp[987523]: /var/lib/neutron/ha_confs/949a39f7-e5d0-4e65-80d7-8a304969ad7b/ha_check_script_1.sh exited with status 1
May 25 15:03:59 network-server1 Keepalived_vrrp[1061033]: /var/lib/neutron/ha_confs/df301506-a68e-47b2-be3d-73a34b40b1bf/ha_check_script_1.sh exited with status 1
May 25 15:04:00 network-server1 Keepalived_vrrp[982249]: /var/lib/neutron/ha_confs/e762fefa-ddc8-4d03-9f31-9f1c613d85a3/ha_check_script_1.sh exited with status 1
May 25 15:04:00 network-server1 Keepalived_vrrp[982249]: VRRP_Script(ha_health_check_1) failed
May 25 15:04:01 network-server1 Keepalived_vrrp[982249]: VRRP_Instance(VR_1) Entering FAULT STATE
May 25 15:04:01 network-server1 Keepalived_vrrp[982249]: VRRP_Instance(VR_1) Now in FAULT state
May 25 15:04:01 network-server1 Keepalived_vrrp[987523]: /var/lib/neutron/ha_confs/949a39f7-e5d0-4e65-80d7-8a304969ad7b/ha_check_script_1.sh exited with status 1
May 25 15:04:01 network-server1 Keepalived_vrrp[987523]: VRRP_Script(ha_health_check_1) failed
May 25 15:04:03 network-server1 Keepalived_vrrp[987523]: VRRP_Instance(VR_1) Entering FAULT STATE
May 25 15:04:03 network-server1 Keepalived_vrrp[987523]: VRRP_Instance(VR_1) Now in FAULT state
May 25 15:04:04 network-server1 Keepalived_vrrp[1061033]: /var/lib/neutron/ha_confs/df301506-a68e-47b2-be3d-73a34b40b1bf/ha_check_script_1.sh exited with status 1
May 25 15:04:04 network-server1 Keepalived_vrrp[1061033]: VRRP_Script(ha_health_check_1) failed
May 25 15:04:05 network-server1 Keepalived_vrrp[1061033]: VRRP_Instance(VR_1) Entering FAULT STATE
May 25 15:04:05 network-server1 Keepalived_vrrp[1061033]: VRRP_Instance(VR_1) removing protocol Virtual Routes
May 25 15:04:05 network-server1 Keepalived_vrrp[1061033]: VRRP_Instance(VR_1) removing protocol VIPs.
May 25 15:04:05 network-server1 Keepalived_vrrp[1061033]: VRRP_Instance(VR_1) removing protocol E-VIPs.
May 25 15:04:05 network-server1 Keepalived_vrrp[1061033]: VRRP_Instance(VR_1) Now in FAULT state
解析:
主路由健康检查异常,进入异常状态,此时备路由接管主路由工作,继续对外提供服务