前言

在neutron支持DVR(分布式路由)后,东西向流量、绑定浮动IP的南北流量都可不再经过网络节点,这大大降低了网络节点的负担,但以下两类流量依然需要经过网络节点:

  1. 未绑定fip的南北流量
  2. 即使vm绑定了fip,但计算节点无外网网络(agent_mode = dvr_no_external)

这两类流量都会影响到vm公网访问,必须要充分保障网络节点高可用

实现原理

网络节点的高可用主要通过keepalived和Router Redundancy Protocol (VRRP)协议实现主备路由间的切换,正常情况下由主(master)路由提供网络服务,在主路由上的keepalived会定期(如每隔5秒)向其他VRRP路由发送心跳数据包,当在一定时间内备(slave)路由上的keepalived未收到主(master)路由上的心跳数据包或主路由进入故障状态时,备(slave)路由会将主路由上的IP等信息配置到备路由的接口上,继续对外提供服务。

说明
如果存在多个备(slave)路由,keepalived会根据每个备(slave)路由上配置的优先级进行选择,如果优先级相同则选择IP地址最高的备路由
心跳数据包使用的网络平面与配置的租户网络类型(tenant_network_types)有关,如vxlan

架构图

如下图所示,主备路由分布在两个网络节点上,心跳数据包通过vxlan网络(VNI101)

openstack更换网络节点 openstack网络节点的作用_openstack更换网络节点

核心配置

网络节点:/etc/neutron/neutron.conf:

router_distributed = True 
l3_ha = True(3层路由高可用)
l3_ha_net_cidr = 169.254.192.0/18(心跳网络地址)
max_l3_agents_per_router = 3
min_l3_agents_per_router = 2

网络节点:/etc/neutron/l3_agent.ini

agent_mode = dvr_snat (设置为分布式路由南北流量snat的服务节点)
ha_vrrp_auth_password = xxxxxx
ha_vrrp_health_check_interval = 5 (vrrp心跳检测时间间隔)

实验验证

通过公网访问VM,验证主路由故障时的情形

  • 环境信息

配置

说明

网络节点

2个

network-server1

网络节点1

network-server2

网络节点2

vm固定IP

192.168.2.156

vm浮动IP

60.191.72.103

  • 主(master)路由节点信息
[root@network-server1 ~]# ip netns exec snat-df301506-a68e-47b2-be3d-73a34b40b1bf ip a
...
132: ha-cdd81756-b0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN qlen 1000
   link/ether fa:16:3e:e7:22:20 brd ff:ff:ff:ff:ff:ff
   inet 169.254.192.3/18 brd 169.254.255.255 scope global ha-cdd81756-b0
      valid_lft forever preferred_lft forever
   inet 169.254.0.1/24 scope global ha-cdd81756-b0
133: sg-5579abe9-04: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN qlen 1000
   link/ether fa:16:3e:1a:5b:f7 brd ff:ff:ff:ff:ff:ff
   inet 192.168.2.159/24 scope global sg-5579abe9-04 
      valid_lft forever preferred_lft forever
   inet6 fe80::f816:3eff:fe1a:5bf7/64 scope link nodad 
      valid_lft forever preferred_lft forever
134: qg-d807dcac-84: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
   link/ether fa:16:3e:54:be:f4 brd ff:ff:ff:ff:ff:ff
   inet 60.191.72.103/32 scope global qg-d807dcac-84 (该地址为vm的fip地址)
      valid_lft forever preferred_lft forever
   inet 60.191.72.109/26 scope global qg-d807dcac-84 (该地址为路由网关地址)
      valid_lft forever preferred_lft forever
   inet6 fe80::f816:3eff:fe54:bef4/64 scope link 
      valid_lft forever preferred_lft forever

解析:
snat namespace为云内网络与外部网络的通道,需存在内、外部网络接口
192.168.2.159/24:云内网络接口,访问外部网络时,云内网络网关(如192.168.1.1)收到数据包后会将数据包转发至网络节点该地址
60.191.72.103/32:vm的fip地址
60.191.72.109/26:无fip时snat地址

  • 备(slave)路由节点信息
...
119: ha-c74f8508-af: <BROADCAST,MULTICAST> mtu 1450 qdisc noqueue state DOWN qlen 1000
   link/ether fa:16:3e:bf:f1:5a brd ff:ff:ff:ff:ff:ff
   inet 169.254.192.11/18 brd 169.254.255.255 scope global ha-c74f8508-af
      valid_lft forever preferred_lft forever
120: sg-5579abe9-04: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN qlen 1000
   link/ether fa:16:3e:1a:5b:f7 brd ff:ff:ff:ff:ff:ff
121: qg-d807dcac-84: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
   link/ether fa:16:3e:54:be:f4 brd ff:ff:ff:ff:ff:ff

解析:
备(slave)路由节点snat namespace中仅存在心跳IP 169.254.192.11,当前不会提供路由服务

  • keepalived 信息
953896 ?        S      0:00 /usr/bin/python2 /bin/neutron-keepalived-state-change --router_id=df301506-a68e-47b2-be3d-73a34b40b1bf --namespace=snat-df301506-a68e-47b2-be3d-73a34b40b1bf --conf_dir=/var/lib/neutron/ha_confs/df301506-a68e-47b2-be3d-73a34b40b1bf --monitor_interface=ha-cdd81756-b0 --monitor_cidr=169.254.0.1/24 --pid_file=/var/lib/neutron/external/pids/df301506-a68e-47b2-be3d-73a34b40b1bf.monitor.pid --state_path=/var/lib/neutron --user=992 --group=989
1061032 ?        Ss     0:04 keepalived -P -f /var/lib/neutron/ha_confs/df301506-a68e-47b2-be3d-73a34b40b1bf/keepalived.conf -p /var/lib/neutron/ha_confs/df301506-a68e-47b2-be3d-73a34b40b1bf.pid -r /var/lib/neutron/ha_confs/df301506-a68e-47b2-be3d-73a34b40b1bf.pid-vrrp -D
1061033 ?        S      0:19  \_ keepalived -P -f /var/lib/neutron/ha_confs/df301506-a68e-47b2-be3d-73a34b40b1bf/keepalived.conf -p /var/lib/neutron/ha_confs/df301506-a68e-47b2-be3d-73a34b40b1bf.pid -r /var/lib/neutron/ha_confs/df301506-a68e-47b2-be3d-73a34b40b1bf.pid-vrrp -D

解析:
keepalived会启动一个进程监控该路由,配置文件为/var/lib/neutron/ha_confs/df301506-a68e-47b2-be3d-73a34b40b1bf/keepalived.conf ,该配置文件中详细配置了如健康检查脚本、优先级、切换时需要配置的ip、路由等网络信息

[root@network-server1 df301506-a68e-47b2-be3d-73a34b40b1bf]# cat keepalived.conf 
global_defs {
    notification_email_from neutron@openstack.local
    router_id neutron
}

vrrp_script ha_health_check_1 {
    script "/var/lib/neutron/ha_confs/df301506-a68e-47b2-be3d-73a34b40b1bf/ha_check_script_1.sh"
    interval 5
    fall 2
    rise 2
}

vrrp_instance VR_1 {
    state BACKUP
    interface ha-cdd81756-b0
    virtual_router_id 1
    priority 50
    garp_master_delay 60
    nopreempt
    advert_int 2
    authentication {
        auth_type PASS
        auth_pass xxxxxx
    }
    track_interface {
        ha-cdd81756-b0
    }
    virtual_ipaddress {
        169.254.0.1/24 dev ha-cdd81756-b0
    }
    virtual_ipaddress_excluded {
        192.168.2.159/24 dev sg-5579abe9-04
        60.191.72.103/32 dev qg-d807dcac-84
        60.191.72.109/26 dev qg-d807dcac-84
        fe80::f816:3eff:fe1a:5bf7/64 dev sg-5579abe9-04 scope link
        fe80::f816:3eff:fe54:bef4/64 dev qg-d807dcac-84 scope link
    }
    virtual_routes {
        0.0.0.0/0 via 60.191.72.65 dev qg-d807dcac-84
        124.160.117.64/26 dev qg-d807dcac-84 scope link
    }
    track_script {
        ha_health_check_1
    }
}
  • 健康检查脚本

keepalived配置文件中配置了健康检查脚本

[root@network-server1]# cat /var/lib/neutron/ha_confs/df301506-a68e-47b2-be3d-73a34b40b1bf/ha_check_script_1.sh
#!/bin/bash -eu
ip a | grep 192.168.2.159 || exit 0
ping -c 1 -w 1 60.191.72.65 1>/dev/null || exit 1

解析:
ip a | grep 192.168.2.159 || exit 0 (如果找到这个IP地址继续执行程序,如果找不到这个IP则执行exit 0)
ping -c 1 -w 1 60.191.72.65 1>/dev/null || exit 1 (如果无法ping通外部网关地址则执行exit 1)
结合以上两个命令,当snat namespace中sg-5579abe9-04口配置了192.168.2.159口但无法ping通外部网关地址时,健康检查返回异常(exit 1)。

  • 触发故障

通过移除外网网卡,验证网络连通性

[root@network-server1 ~]# ovs-vsctl del-port br-bond2 bond2

ping VM返回信息,中断约3秒钟

来自 60.191.72.103 的回复: 字节=32 时间=12ms TTL=53
来自 60.191.72.103 的回复: 字节=32 时间=5ms TTL=53
请求超时。
请求超时。
请求超时。
来自 60.191.72.103 的回复: 字节=32 时间=6ms TTL=53
来自 60.191.72.103 的回复: 字节=32 时间=5ms TTL=53
  • 故障后主备路由

keepalived切换日志

May 25 15:03:55 network-server1 Keepalived_vrrp[982249]: /var/lib/neutron/ha_confs/e762fefa-ddc8-4d03-9f31-9f1c613d85a3/ha_check_script_1.sh exited with status 1
May 25 15:03:56 network-server1 Keepalived_vrrp[987523]: /var/lib/neutron/ha_confs/949a39f7-e5d0-4e65-80d7-8a304969ad7b/ha_check_script_1.sh exited with status 1
May 25 15:03:59 network-server1 Keepalived_vrrp[1061033]: /var/lib/neutron/ha_confs/df301506-a68e-47b2-be3d-73a34b40b1bf/ha_check_script_1.sh exited with status 1
May 25 15:04:00 network-server1 Keepalived_vrrp[982249]: /var/lib/neutron/ha_confs/e762fefa-ddc8-4d03-9f31-9f1c613d85a3/ha_check_script_1.sh exited with status 1
May 25 15:04:00 network-server1 Keepalived_vrrp[982249]: VRRP_Script(ha_health_check_1) failed
May 25 15:04:01 network-server1 Keepalived_vrrp[982249]: VRRP_Instance(VR_1) Entering FAULT STATE
May 25 15:04:01 network-server1 Keepalived_vrrp[982249]: VRRP_Instance(VR_1) Now in FAULT state
May 25 15:04:01 network-server1 Keepalived_vrrp[987523]: /var/lib/neutron/ha_confs/949a39f7-e5d0-4e65-80d7-8a304969ad7b/ha_check_script_1.sh exited with status 1
May 25 15:04:01 network-server1 Keepalived_vrrp[987523]: VRRP_Script(ha_health_check_1) failed
May 25 15:04:03 network-server1 Keepalived_vrrp[987523]: VRRP_Instance(VR_1) Entering FAULT STATE
May 25 15:04:03 network-server1 Keepalived_vrrp[987523]: VRRP_Instance(VR_1) Now in FAULT state
May 25 15:04:04 network-server1 Keepalived_vrrp[1061033]: /var/lib/neutron/ha_confs/df301506-a68e-47b2-be3d-73a34b40b1bf/ha_check_script_1.sh exited with status 1
May 25 15:04:04 network-server1 Keepalived_vrrp[1061033]: VRRP_Script(ha_health_check_1) failed
May 25 15:04:05 network-server1 Keepalived_vrrp[1061033]: VRRP_Instance(VR_1) Entering FAULT STATE
May 25 15:04:05 network-server1 Keepalived_vrrp[1061033]: VRRP_Instance(VR_1) removing protocol Virtual Routes
May 25 15:04:05 network-server1 Keepalived_vrrp[1061033]: VRRP_Instance(VR_1) removing protocol VIPs.
May 25 15:04:05 network-server1 Keepalived_vrrp[1061033]: VRRP_Instance(VR_1) removing protocol E-VIPs.
May 25 15:04:05 network-server1 Keepalived_vrrp[1061033]: VRRP_Instance(VR_1) Now in FAULT state

解析:
主路由健康检查异常,进入异常状态,此时备路由接管主路由工作,继续对外提供服务