1 测试环境加了一个节点, 后台日志突然出现大量 异常,但新建的topic还能正常运行
1.1
[2021-10-29 23:59:59,840] ERROR [ReplicaFetcherThread-0-33], Error for partition [585cd97cab31fb583f7338f2,10] to broker 33:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread)
上述异常大量,然后偶尔跟随一个如下异常:
1.2 kafka.common.NotAssignedReplicaException: Leader 57 failed to record follower 58's position 7387 since the replica is not recognized to be one of the assigned replicas 57 for partition c13820d7b5ca40eaad8a744531c99519-0.
这个异常对应的topic是存量的并且并未删除的topic.
第二个异常,我检查了zookeeper发现有多个进程,可能是有个同事发现有异常信息,然后重启了ZK.结果没有删除原来的ZK。
通过杀死所有zk,再重新启动。 没有发现第二个包含正常topic的异常了。但是还是出现下述异常
2 后台发现如下异常
Error for partition [585cd97cab31fb583f7338f2,1] to broker 33:org.appache.kafka.common.errors.UnknownServerException:The server experienced an unexpected error when processing the request(kafka.server.ReplicaFetcherThread)
kafka.common.KafkaException: Should not set log end offset on partition's local replication 58
查阅文档:
kafka.common.UnknownException(kafka.server.ReplicaFetcherThread)
A common problem is that more than one broker registered the same host/port in Zookeeper. As a result, the replica fetcher is confused when fetching data from the leader. To verify that, you can use a Zookeeper client shell to list the registration info of each broker. The Zookeeper path and the format of the broker registration is described in Kafka data structures in Zookeeper. You want to make sure that all the registered brokers have unique host/port.
,
这段内容的大意思是不允许在zookeeper上注册来自同一个主机+端口的brokers,开始我以为是同一台机器上挂了二个brokers的缘故,但是检查确实没有。
当server.properties没有设置 host.name,默认是注册到all interface name, 即默认localhost.
于是我知道这个参数在默认不配置的时候,绑定的是当前主机127.0.0.1,所以集群中主机之间进行相互备份的时候通过127.0.0.1找不到主机了。
修改host.name(server.propeties)为IP后,上述异常解决。
3 后台大量如下异常
现在大量异常,仅仅是针对如下已删除的topic:585cd97cab31fb583f7338f2
kafka.common.NotAssignedReplicaException: Leader 58 failed to record follower 57's position -1 since the replica is not recognized to be one of the assigned replicas for partition 585cd97cab31fb583f7338f2-6.
:
从异常信息,大概意思是 leader 58记录follower 57的postion失败. 原因是 副本根本不在分区分配的副本集中。
查看分区信息:58 并不是leader,也不是 follow。如下:
[kafka-01 bin]$ ./kafka-topics.sh --describe --zookeeper localhost:2181 --topic 585cd97cab31fb583f7338f2
Topic:585cd97cab31fb583f7338f2 PartitionCount:12 ReplicationFactor:2 Configs:
Topic: 585cd97cab31fb583f7338f2 Partition: 0 Leader: 57 Replicas: 57,33 Isr: 57,33
Topic: 585cd97cab31fb583f7338f2 Partition: 1 Leader: 33 Replicas: 33,57 Isr: 33,57
Topic: 585cd97cab31fb583f7338f2 Partition: 2 Leader: 57 Replicas: 57,33 Isr: 57,33
Topic: 585cd97cab31fb583f7338f2 Partition: 3 Leader: 33 Replicas: 33,57 Isr: 33,57
Topic: 585cd97cab31fb583f7338f2 Partition: 4 Leader: 57 Replicas: 57,33 Isr: 57
Topic: 585cd97cab31fb583f7338f2 Partition: 5 Leader: 33 Replicas: 33,57 Isr: 33,57
Topic: 585cd97cab31fb583f7338f2 Partition: 6 Leader: 57 Replicas: 57,33 Isr: 57,33
Topic: 585cd97cab31fb583f7338f2 Partition: 7 Leader: 33 Replicas: 33,57 Isr: 33,57
Topic: 585cd97cab31fb583f7338f2 Partition: 8 Leader: 57 Replicas: 57,33 Isr: 57,33
Topic: 585cd97cab31fb583f7338f2 Partition: 9 Leader: 33 Replicas: 33,57 Isr: 33,57
Topic: 585cd97cab31fb583f7338f2 Partition: 10 Leader: 57 Replicas: 57,33 Isr: 57
Topic: 585cd97cab31fb583f7338f2 Partition: 11 Leader: 33 Replicas: 33,57 Isr: 33,57
手动执行topic删除,585cd97cab31fb583f7338f2分区被删除了(kafka和zk清理),但是每次重启kafka后zk重新存在。
删除旧分区
./kafka-topics.sh --delete --zookeeper 192.168.5.57:2181 --topic 585cd97cab31fb583f7338f2
rmr /brokers/topics/585cd97cab31fb583f7338f2
rmr /config/topics/585cd97cab31fb583f7338f2
rmr /admin/delete_topics/585cd97cab31fb583f7338f2
但是删除分区后重启kafka还是同样的问题。并且发现topic依旧存在。可见topic清理没有成功。
检查 zkCli.sh, 查看相关的topic信息确实存在。
/brokers/topics/585cd97cab31fb583f7338f2
/config/topics/585cd97cab31fb583f7338f2
/admin/delete_topics/585cd97cab31fb583f7338f2
经过执行下述命令,发现585cd97cab31fb583f7338f2 这个 topic的分区偏移量还在不断增长,
./kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list 192.168.5.57:9092 --topic 585cd97cab31fb583f7338f2。
因此判断,首先这边删除分区失败。分析是有程序还在往这个topic写数据,导致自动创建了同样的分区。
查阅相关资料, 如果要彻底删除分区, 需要修改server.config配置文件设置参数:auto.create.topic.enable=false,禁用自动创建topic.
重启kafka,然后再执行上述的删除topic操作。
上述步骤执行后, 后台日志终于正常了。