最近项目中有开发人员使用在SpringBoot中使用hazelcast,出现了一些问题,然后分析了一下问题原因,记录一下。
一、背景
开发人员依照经验,在SpringBoot中是这样使用hazelcast的:
1. 在Classpath下创建了hazelcast的配置文件,其中<network>是这么配置的:
<network>
<port auto-increment="true" port-count="3">5895</port>
</network>
2.在业务代码中是这样加载hazelcast的配置的:
Config config = new ClasspathXmlConfig("hazelcast.xml");
NetworkConfig network = config.getNetworkConfig();
JoinConfig join = network.getJoin();
join.getMulticastConfig().setEnabled(false);
join.getTcpIpConfig().addMember(sysConfig.getMember()).setEnabled(true);
hazelcastInstance = Hazelcast.newHazelcastInstance(config);
其中的sysConfig.getMerber()最开始配置的是127.0.0.1
之所以这么写,目的是为了使用Tcp-ip组网,不使用multicast组网
二、问题
结果在系统运行过程中,发生了多次服务器CPU 100%的情况,此时通过查看系统日志发现出现了GC overhead limit错误,然后系统伴随着大量如下的重复日志:
2019-07-23 17:09:45.042 INFO 11120 --- [thread-Acceptor] c.h.nio.tcp.SocketAcceptorThread : [192.168.21.64]:5896 [] [3.8.3] Accepting socket connection from /127.0.0.1:50790
2019-07-23 17:09:45.056 INFO 11120 --- [cached.thread-4] c.h.nio.tcp.TcpIpConnectionManager : [192.168.21.64]:5896 [] [3.8.3] Established socket connection between /127.0.0.1:5896 and /127.0.0.1:50790
2019-07-23 17:09:45.056 INFO 11120 --- [cached.thread-1] c.h.nio.tcp.TcpIpConnectionManager : [127.0.0.1]:5895 [] [3.8.3] Established socket connection between /127.0.0.1:50790 and /127.0.0.1:5896
2019-07-23 17:09:45.074 WARN 11120 --- [.IO.thread-in-0] c.h.nio.tcp.TcpIpConnectionManager : [192.168.21.64]:5896 [] [3.8.3] Wrong bind request from [127.0.0.1]:5895! This node is not requested endpoint: [127.0.0.1]:5896
2019-07-23 17:09:45.074 INFO 11120 --- [.IO.thread-in-0] com.hazelcast.nio.tcp.TcpIpConnection : [192.168.21.64]:5896 [] [3.8.3] Connection[id=1, /127.0.0.1:5896->/127.0.0.1:50790, endpoint=null, alive=false, type=MEMBER] closed. Reason: Wrong bind request from [127.0.0.1]:5895! This node is not requested endpoint: [127.0.0.1]:5896
2019-07-23 17:09:45.077 INFO 11120 --- [.IO.thread-in-0] com.hazelcast.nio.tcp.TcpIpConnection : [127.0.0.1]:5895 [sms-token] [3.8.3] Connection[id=1, /127.0.0.1:50790->/127.0.0.1:5896, endpoint=[127.0.0.1]:5896, alive=false, type=NONE] closed. Reason: Connection closed by the other side
2019-07-23 17:10:24.914 INFO 11120 --- [thread-Acceptor] c.h.nio.tcp.SocketAcceptorThread : [192.168.21.64]:5896 [] [3.8.3] Accepting socket connection from /127.0.0.1:50804
2019-07-23 17:10:24.915 INFO 11120 --- [cached.thread-4] c.h.nio.tcp.TcpIpConnectionManager : [127.0.0.1]:5895 [] [3.8.3] Established socket connection between /127.0.0.1:50804 and /127.0.0.1:5896
2019-07-23 17:10:24.917 INFO 11120 --- [cached.thread-3] c.h.nio.tcp.TcpIpConnectionManager : [192.168.21.64]:5896 [] [3.8.3] Established socket connection between /127.0.0.1:5896 and /127.0.0.1:50804
2019-07-23 17:10:24.919 WARN 11120 --- [.IO.thread-in-1] c.h.nio.tcp.TcpIpConnectionManager : [192.168.21.64]:5896 [] [3.8.3] Wrong bind request from [127.0.0.1]:5895! This node is not requested endpoint: [127.0.0.1]:5896
2019-07-23 17:10:24.919 INFO 11120 --- [.IO.thread-in-1] com.hazelcast.nio.tcp.TcpIpConnection : [192.168.21.64]:5896 [] [3.8.3] Connection[id=2, /127.0.0.1:5896->/127.0.0.1:50804, endpoint=null, alive=false, type=MEMBER] closed. Reason: Wrong bind request from [127.0.0.1]:5895! This node is not requested endpoint: [127.0.0.1]:5896
2019-07-23 17:10:24.922 INFO 11120 --- [.IO.thread-in-1] com.hazelcast.nio.tcp.TcpIpConnection : [127.0.0.1]:5895 [] [3.8.3] Connection[id=2, /127.0.0.1:50804->/127.0.0.1:5896, endpoint=[127.0.0.1]:5896, alive=false, type=MEMBER] closed. Reason: Connection closed by the other side
2019-07-23 17:11:04.861 INFO 11120 --- [cached.thread-4] c.h.nio.tcp.TcpIpConnectionManager : [127.0.0.1]:5895 [] [3.8.3] Established socket connection between /127.0.0.1:50819 and /127.0.0.1:5896
三、问题分析及解决
1. 通过日志中的WARN信息,找到“Wrong bind request from [127.0.0.1]:5895! This node is not requested endpoint: [127.0.0.1]:5896”,这句告警是从hazelcast中BindHandler类中抛出的:
private boolean ensureValidBindTarget(TcpIpConnection connection, Address remoteEndPoint, Address localEndpoint,
Address thisAddress) {
if (ioService.isSocketBindAny() && !connection.isClient() && !thisAddress.equals(localEndpoint)) {
String msg =
"Wrong bind request from " + remoteEndPoint + "! This node is not the requested endpoint: " + localEndpoint;
logger.warning(msg);
connection.close(msg, null);
return false;
}
return true;
}
从日志和源码可以看出,5895端在向5896端发送bind request时,会先验证是否可以绑定。而且5896端是以TcpIpConnnection的方式发送bind request的,而结果是localEndpoint并不是其可以请求的终端。为什么会出现这种情况呢?
然后我们将member从127.0.0.1改为实际的IP192.168.21.64,后产生了另外的重复日志:
2019-07-23 18:39:10.035 INFO 3672 --- [thread-Acceptor] c.h.nio.tcp.SocketAcceptorThread : [192.168.21.64]:5896 [sms-token] [3.8.3] Accepting socket connection from /172.16.21.64:54418
2019-07-23 18:39:10.051 INFO 3672 --- [cached.thread-3] c.h.nio.tcp.TcpIpConnectionManager : [192.168.21.64]:5895 [sms-token] [3.8.3] Established socket connection between /172.16.21.64:54418 and /172.16.21.64:5896
2019-07-23 18:39:10.051 INFO 3672 --- [cached.thread-2] c.h.nio.tcp.TcpIpConnectionManager : [192.168.21.64]:5896 [sms-token] [3.8.3] Established socket connection between /172.16.21.64:5896 and /172.16.21.64:54418
2019-07-23 18:39:10.083 WARN 3672 --- [ration.thread-0] c.h.i.cluster.impl.ClusterJoinManager : [192.168.21.64]:5896 [sms-token] [3.8.3] Invalid join request from [172.16.21.64]:5895, cause: Incompatible joiners! expected: multicast, found: tcp-ip
2019-07-23 18:39:50.026 WARN 3672 --- [ration.thread-1] c.h.i.cluster.impl.ClusterJoinManager : [192.168.21.64]:5896 [sms-token] [3.8.3] Invalid join request from [172.16.21.64]:5895, cause: Incompatible joiners! expected: multicast, found: tcp-ip
2019-07-23 18:40:30.026 WARN 3672 --- [ration.thread-0] c.h.i.cluster.impl.ClusterJoinManager : [192.168.21.64]:5896 [sms-token] [3.8.3] Invalid join request from [172.16.21.64]:5895, cause: Incompatible joiners! expected: multicast, found: tcp-ip
其中的“ Incompatible joiners! expected: multicast, found: tcp-ip”让人无法理解,明明我们将multicast关闭了,为什么还会期望得到multicast的join?
然后我们查了下hazelcast的源码,这行warn在ConfigCheck类中:
private void verifyJoiner(ConfigCheck found) {
if (!equals(joinerType, found.joinerType)) {
throw new ConfigMismatchException("Incompatible joiners! expected: " + joinerType + ", found: " + found.joinerType);
}
}
其中的joinerType可以看出是从配置文件中通过流获取的。而配置文件中,我们的<network>确实没有配置multicast为false,没有配置的话,hazelcast默认是为true的。但我们明明在代码中load hazelcast.xml了呀,而且也将multicast设置为false了呀,为什么没有生效呢?而且为什么在同一台机器上127.0.0.1会去join127.0.0.1呢?而且还是以tcp-ip的形式发起join请求被multicast拒?
而且,以java代码的方式加载hazelcast的配置我们之前已经用过很多次了,没出过什么问题。这个时候,我们才开始怀疑,springboot中是不是自己加载了一遍hazelcast的配置?然后我们找到了系统的启动日志,发现hazelcast配置确实被加载了2遍!而且确实启动了2个节点,一个5895,一个5896!
因为系统启动时速度很快,日志一闪而过,而且没有报错抛异常,系统出现问题时我们看日志时都是从异常的地方找起,从没想过从系统启动的日志去看问题,所以这个过程之前都被我们忽略了。
最后,我们将代码加载hazelcast配置的代码去掉,统一从配置文件加载,问题解决。
至于hazelcast的这些重复日志和GC overhead limit 以及和CPU 100%的关系,目前只能从问题发生的现象看是有关联的,具体是否真的是有关系的,原因是什么,待分析。