最近项目中有开发人员使用在SpringBoot中使用hazelcast,出现了一些问题,然后分析了一下问题原因,记录一下。

一、背景

开发人员依照经验,在SpringBoot中是这样使用hazelcast的:

1. 在Classpath下创建了hazelcast的配置文件,其中<network>是这么配置的:

<network>
        <port auto-increment="true" port-count="3">5895</port>
    </network>

2.在业务代码中是这样加载hazelcast的配置的:

Config config = new ClasspathXmlConfig("hazelcast.xml");
NetworkConfig network = config.getNetworkConfig();
JoinConfig join = network.getJoin();
join.getMulticastConfig().setEnabled(false);
join.getTcpIpConfig().addMember(sysConfig.getMember()).setEnabled(true);
hazelcastInstance = Hazelcast.newHazelcastInstance(config);

其中的sysConfig.getMerber()最开始配置的是127.0.0.1

之所以这么写,目的是为了使用Tcp-ip组网,不使用multicast组网

二、问题

结果在系统运行过程中,发生了多次服务器CPU 100%的情况,此时通过查看系统日志发现出现了GC overhead limit错误,然后系统伴随着大量如下的重复日志:

2019-07-23 17:09:45.042  INFO 11120 --- [thread-Acceptor] c.h.nio.tcp.SocketAcceptorThread         : [192.168.21.64]:5896 [] [3.8.3] Accepting socket connection from /127.0.0.1:50790
2019-07-23 17:09:45.056  INFO 11120 --- [cached.thread-4] c.h.nio.tcp.TcpIpConnectionManager       : [192.168.21.64]:5896 [] [3.8.3] Established socket connection between /127.0.0.1:5896 and /127.0.0.1:50790
2019-07-23 17:09:45.056  INFO 11120 --- [cached.thread-1] c.h.nio.tcp.TcpIpConnectionManager       : [127.0.0.1]:5895 [] [3.8.3] Established socket connection between /127.0.0.1:50790 and /127.0.0.1:5896
2019-07-23 17:09:45.074  WARN 11120 --- [.IO.thread-in-0] c.h.nio.tcp.TcpIpConnectionManager       : [192.168.21.64]:5896 [] [3.8.3] Wrong bind request from [127.0.0.1]:5895! This node is not requested endpoint: [127.0.0.1]:5896
2019-07-23 17:09:45.074  INFO 11120 --- [.IO.thread-in-0] com.hazelcast.nio.tcp.TcpIpConnection    : [192.168.21.64]:5896 [] [3.8.3] Connection[id=1, /127.0.0.1:5896->/127.0.0.1:50790, endpoint=null, alive=false, type=MEMBER] closed. Reason: Wrong bind request from [127.0.0.1]:5895! This node is not requested endpoint: [127.0.0.1]:5896
2019-07-23 17:09:45.077  INFO 11120 --- [.IO.thread-in-0] com.hazelcast.nio.tcp.TcpIpConnection    : [127.0.0.1]:5895 [sms-token] [3.8.3] Connection[id=1, /127.0.0.1:50790->/127.0.0.1:5896, endpoint=[127.0.0.1]:5896, alive=false, type=NONE] closed. Reason: Connection closed by the other side
2019-07-23 17:10:24.914  INFO 11120 --- [thread-Acceptor] c.h.nio.tcp.SocketAcceptorThread         : [192.168.21.64]:5896 [] [3.8.3] Accepting socket connection from /127.0.0.1:50804
2019-07-23 17:10:24.915  INFO 11120 --- [cached.thread-4] c.h.nio.tcp.TcpIpConnectionManager       : [127.0.0.1]:5895 [] [3.8.3] Established socket connection between /127.0.0.1:50804 and /127.0.0.1:5896
2019-07-23 17:10:24.917  INFO 11120 --- [cached.thread-3] c.h.nio.tcp.TcpIpConnectionManager       : [192.168.21.64]:5896 [] [3.8.3] Established socket connection between /127.0.0.1:5896 and /127.0.0.1:50804
2019-07-23 17:10:24.919  WARN 11120 --- [.IO.thread-in-1] c.h.nio.tcp.TcpIpConnectionManager       : [192.168.21.64]:5896 [] [3.8.3] Wrong bind request from [127.0.0.1]:5895! This node is not requested endpoint: [127.0.0.1]:5896
2019-07-23 17:10:24.919  INFO 11120 --- [.IO.thread-in-1] com.hazelcast.nio.tcp.TcpIpConnection    : [192.168.21.64]:5896 [] [3.8.3] Connection[id=2, /127.0.0.1:5896->/127.0.0.1:50804, endpoint=null, alive=false, type=MEMBER] closed. Reason: Wrong bind request from [127.0.0.1]:5895! This node is not requested endpoint: [127.0.0.1]:5896
2019-07-23 17:10:24.922  INFO 11120 --- [.IO.thread-in-1] com.hazelcast.nio.tcp.TcpIpConnection    : [127.0.0.1]:5895 [] [3.8.3] Connection[id=2, /127.0.0.1:50804->/127.0.0.1:5896, endpoint=[127.0.0.1]:5896, alive=false, type=MEMBER] closed. Reason: Connection closed by the other side
2019-07-23 17:11:04.861  INFO 11120 --- [cached.thread-4] c.h.nio.tcp.TcpIpConnectionManager       : [127.0.0.1]:5895 [] [3.8.3] Established socket connection between /127.0.0.1:50819 and /127.0.0.1:5896

三、问题分析及解决

1. 通过日志中的WARN信息,找到“Wrong bind request from [127.0.0.1]:5895! This node is not requested endpoint: [127.0.0.1]:5896”,这句告警是从hazelcast中BindHandler类中抛出的:

private boolean ensureValidBindTarget(TcpIpConnection connection, Address remoteEndPoint, Address localEndpoint,
                                          Address thisAddress) {
        if (ioService.isSocketBindAny() && !connection.isClient() && !thisAddress.equals(localEndpoint)) {
            String msg =
                    "Wrong bind request from " + remoteEndPoint + "! This node is not the requested endpoint: " + localEndpoint;
            logger.warning(msg);
            connection.close(msg, null);
            return false;
        }
        return true;
    }

从日志和源码可以看出,5895端在向5896端发送bind request时,会先验证是否可以绑定。而且5896端是以TcpIpConnnection的方式发送bind request的,而结果是localEndpoint并不是其可以请求的终端。为什么会出现这种情况呢?

然后我们将member从127.0.0.1改为实际的IP192.168.21.64,后产生了另外的重复日志:

2019-07-23 18:39:10.035  INFO 3672 --- [thread-Acceptor] c.h.nio.tcp.SocketAcceptorThread         : [192.168.21.64]:5896 [sms-token] [3.8.3] Accepting socket connection from /172.16.21.64:54418
2019-07-23 18:39:10.051  INFO 3672 --- [cached.thread-3] c.h.nio.tcp.TcpIpConnectionManager       : [192.168.21.64]:5895 [sms-token] [3.8.3] Established socket connection between /172.16.21.64:54418 and /172.16.21.64:5896
2019-07-23 18:39:10.051  INFO 3672 --- [cached.thread-2] c.h.nio.tcp.TcpIpConnectionManager       : [192.168.21.64]:5896 [sms-token] [3.8.3] Established socket connection between /172.16.21.64:5896 and /172.16.21.64:54418
2019-07-23 18:39:10.083  WARN 3672 --- [ration.thread-0] c.h.i.cluster.impl.ClusterJoinManager    : [192.168.21.64]:5896 [sms-token] [3.8.3] Invalid join request from [172.16.21.64]:5895, cause: Incompatible joiners! expected: multicast, found: tcp-ip
2019-07-23 18:39:50.026  WARN 3672 --- [ration.thread-1] c.h.i.cluster.impl.ClusterJoinManager    : [192.168.21.64]:5896 [sms-token] [3.8.3] Invalid join request from [172.16.21.64]:5895, cause: Incompatible joiners! expected: multicast, found: tcp-ip
2019-07-23 18:40:30.026  WARN 3672 --- [ration.thread-0] c.h.i.cluster.impl.ClusterJoinManager    : [192.168.21.64]:5896 [sms-token] [3.8.3] Invalid join request from [172.16.21.64]:5895, cause: Incompatible joiners! expected: multicast, found: tcp-ip

其中的“ Incompatible joiners! expected: multicast, found: tcp-ip”让人无法理解,明明我们将multicast关闭了,为什么还会期望得到multicast的join?

然后我们查了下hazelcast的源码,这行warn在ConfigCheck类中:

private void verifyJoiner(ConfigCheck found) {
        if (!equals(joinerType, found.joinerType)) {
            throw new ConfigMismatchException("Incompatible joiners! expected: " + joinerType + ", found: " + found.joinerType);
        }
    }

其中的joinerType可以看出是从配置文件中通过流获取的。而配置文件中,我们的<network>确实没有配置multicast为false,没有配置的话,hazelcast默认是为true的。但我们明明在代码中load hazelcast.xml了呀,而且也将multicast设置为false了呀,为什么没有生效呢?而且为什么在同一台机器上127.0.0.1会去join127.0.0.1呢?而且还是以tcp-ip的形式发起join请求被multicast拒?

而且,以java代码的方式加载hazelcast的配置我们之前已经用过很多次了,没出过什么问题。这个时候,我们才开始怀疑,springboot中是不是自己加载了一遍hazelcast的配置?然后我们找到了系统的启动日志,发现hazelcast配置确实被加载了2遍!而且确实启动了2个节点,一个5895,一个5896!

因为系统启动时速度很快,日志一闪而过,而且没有报错抛异常,系统出现问题时我们看日志时都是从异常的地方找起,从没想过从系统启动的日志去看问题,所以这个过程之前都被我们忽略了。

最后,我们将代码加载hazelcast配置的代码去掉,统一从配置文件加载,问题解决。

至于hazelcast的这些重复日志和GC overhead limit 以及和CPU 100%的关系,目前只能从问题发生的现象看是有关联的,具体是否真的是有关系的,原因是什么,待分析。