kafka使用也很久了,如何细粒度的监控kafka,目前还找不到一款比较合适的开源监控工具,但是不妨碍总结一下如何监控kafka,最官方的方法就是使用metrics的值来监控kafka,目前我们就是使用jmxtrans来获取metrics值监控kafka的。kafak监控主要分为三个方面:broker监控、consumer监控、producer监控。三者的监控项可以通过jconsole来查看具体的mbean。
在具体查看mbean之前,我们先了解一下metrics(kafka使用Yammer metrics ),metrics是一个度量工具包,提供多种度量类型来统计程序的各项指标,在JAVA代码中嵌入Metrics代码,可以方便的对业务代码的各个指标进行监控。metrics默认支持并开启了jmx的方式暴露监控数据,开发人员可以使用jmx的方式轻松的获取度量数据,metrics还支持其他类型的reporter,如csv、console,metrics提供了对Ehcache、Apache HttpClient、JDBI、Jersey、Jetty、Log4J、Logback、JVM等的集成,可以方便地将Metrics输出到Ganglia、Graphite中,供用户图形化展示。metrics提供了五种类型的度量类型:
- gauge:是一个最简单的计量,一般用来统计瞬时状态的数据信息,比如系统中处于pending状态的job
- counter:是gauge的一个特例,维护一个计数器,可以通过inc()和dec()方法对计数器做修改。一般用来记录某个事件发生的次数或者请求的个数
- meters:用来度量某个时间段的平均处理次数(request per second)。统计结果有总的请求数,平均每秒的请求数,以及最近的1、5、15分钟的平均TPS。
- histograms:主要使用来统计数据的分布情况,最大值、最小值、平均值、中位数,百分比(75%、90%、95%、98%、99%和99.9%)。例如,需要统计某个页面的请求响应时间分布情况,可以使用该种类型的Metrics进行统计
- timers:主要是用来统计某一块代码段的执行时间以及其分布情况,具体是基于Histograms和Meters来实现的
- health checks:用于对Application、其子模块或者关联模块的运行是否正常做检测。该模块是独立metrics-core模块的,使用时则导入metrics-healthchecks包
broker监控
broker metrics
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions Number of under-replicated partitions (ISR < all replicas ). Alert if value is greater than 0.
kafka.controller:type=KafkaController,name=OfflinePartitionsCount Number of partitions that don’t have an active leader and are hence not writable or readable. Alert if value is greater than 0.
kafka.controller:type=KafkaController,name=ActiveControllerCount Number of active controllers in the cluster. Alert if value is anything other than 1.
kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec Aggregate incoming message rate.
kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec Aggregate incoming byte rate.
kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec Aggregate outgoing byte rate.
kafka.network:type=RequestMetrics,name=RequestsPerSec,request={Produce or FetchConsumer or FetchFollower} Request rate.
kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs Log flush rate and time.
kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs Leader election rate and latency.
kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec Unclean leader election rate.
kafka.server:type=ReplicaManager,name=PartitionCount Number of partitions on this broker. This should be mostly even across all brokers.
kafka.server:type=ReplicaManager,name=LeaderCount Number of leaders on this broker. This should be mostly even across all brokers. If not, set auto.leader.rebalance.enable to true on all brokers in the cluster.
kafka.server:type=ReplicaManager,name=IsrShrinksPerSec If a broker goes down, ISR for some of the partitions will shrink. When that broker is up again, ISR will be expanded once the replicas are fully caught up. Other than that, the expected value for both ISR shrink rate and expansion rate is 0.
kafka.server:type=ReplicaManager,name=IsrExpandsPerSec When a broker is brought up after a failure, it starts catching up by reading from the leader. Once it is caught up, it gets added back to the ISR.
kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica Maximum lag in messages between the follower and leader replicas. This is controlled by the replica.lag.max.messages config.
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=([-.\w]+),topic=([-.\w]+),partition=([0-9]+) Lag in number of messages per follower replica. This is useful to know if the replica is slow or has stopped replicating from the leader.
kafka.network:type=RequestMetrics,name=TotalTimeMs,request={Produce or FetchConsumer or FetchFollower} Total time in ms to serve the specified request.
kafka.server:type=ProducerRequestPurgatory,name=PurgatorySize Number of requests waiting in the producer purgatory. This should be non-zero acks=-1 is used on the producer.
kafka.server:type=FetchRequestPurgatory,name=PurgatorySize Number of requests waiting in the fetch purgatory. This is high if consumers use a large value for fetch.wait.max.ms .
生产jmxtrans-agent.xml样例
<jmxtrans-agent>
<queries>
<!-- Message in rate -->
<query objectName="kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec" attributes="MeanRate,OneMinuteRate" resultAlias="Kafka.BrokerTopicMetrics.MessagesInPerSec.#attribute#"/>
<!-- Byte in rate -->
<query objectName="kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec" attributes="MeanRate,OneMinuteRate" resultAlias="Kafka.BrokerTopicMetrics.BytesInPerSec.#attribute#"/>
<!-- Byte out rate -->
<query objectName="kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec" attributes="MeanRate,OneMinuteRate" resultAlias="Kafka.BrokerTopicMetrics.BytesOutPerSec.#attribute#"/>
<!-- Log flush rate and time -->
<query objectName="kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs" attributes="OneMinuteRate" resultAlias="Kafka.LogFlushStats.FlushRateAndTimeMs.#attribute#"/>
<!-- Request rate -->
<query objectName="kafka.network:type=RequestMetrics,name=RequestsPerSec,request=Produce" attributes="OneMinuteRate" resultAlias="Kafka.RequestsPerSec.Produce.#attribute#"/>
<query objectName="kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchConsumer" attributes="OneMinuteRate" resultAlias="Kafka.RequestsPerSec.FetchConsumer.#attribute#"/>
<query objectName="kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchFollower" attributes="OneMinuteRate" resultAlias="Kafka.RequestsPerSec.FetchFollower.#attribute#"/>
<!-- Log flush rate and time -->
<query objectName="kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs" resultAlias="Kafka.LogFlushStats.LogFlushRateAndTimeMs.#attribute#"/>
<!-- Partition counts -->
<query objectName="kafka.server:type=ReplicaManager,name=PartitionCount" attribute="Value" resultAlias="Kafka.Topic.PartitionCount.#attribute#"/>
<!-- of under replicated partitions (|ISR| < |all replicas|) -->
<query objectName="kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions" resultAlias="Kafka.ReplicaManager.UnderReplicatedPartitions.#attribute#"/>
<!-- Is controller active on broker -->
<query objectName="kafka.controller:type=KafkaController,name=ActiveControllerCount" resultAlias="Kafka.KafkaController.ActiveControllerCount.#attribute#"/>
<!-- Leader election rate -->
<query objectName="kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs" attributes="OneMinuteRate,Max,Min,75thPercentile,95thPercentile" resultAlias="Kafka.ControllerStats.LeaderElectionRateAndTimeMs.#attribute#"/>
<!-- Leader replica counts -->
<query objectName="kafka.server:type=ReplicaManager,name=LeaderCount" resultAlias="Kafka.ReplicaManager.LeaderCount.#attribute#"/>
<!-- Max lag in messages btw follower and leader replicas -->
<query objectName="kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica" resultAlias="Kafka.ReplicaFetcherManager.MaxLag.#attribute#"/>
<!-- Lag in messages per follower replica -->
<query objectName="kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=([-.\w]+)" resultAlias="Kafka.ConsumerLag.clientId.#attribute#"/>
<query objectName="kafka.server:type=FetcherLagMetrics,name=ConsumerLag,topic=([-.\w]+)" resultAlias="Kafka.ConsumerLag.topic.#attribute#"/>
<query objectName="kafka.server:type=FetcherLagMetrics,name=ConsumerLag,partition=([0-9]+)" resultAlias="Kafka.ConsumerLag.partition.#attribute#"/>
<!-- Requests waiting in the producer purgatory -->
<query objectName="kafka.server:type=ProducerRequestPurgatory,name=PurgatorySize" resultAlias="Kafka.ProducerRequestPurgatory.PurgatorySize.#attribute#"/>
<!-- Requests waiting in the fetch purgatory -->
<query objectName="kafka.server:type=FetchRequestPurgatory,name=PurgatorySize" resultAlias="Kafka.FetchRequestPurgatory.PurgatorySize.#attribute#"/>
<!-- Request total time -->
<query objectName="kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce" attributes="Max,Min,75thPercentile,95thPercentile" resultAlias="Kafka.RequestMetrics.TotalTimeMs.Produce.#attribute#"/>
<query objectName="kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer" attributes="Max,Min,75thPercentile,95thPercentile" resultAlias="Kafka.RequestMetrics.TotalTimeMs.FetchConsumer.#attribute#"/>
<query objectName="kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower" attributes="Max,Min,75thPercentile,95thPercentile" resultAlias="Kafka.RequestMetrics.TotalTimeMs.FetchFollower.#attribute#"/>
<!--Time the request waiting in the request queue -->
<query objectName="kafka.network:type=RequestMetrics,name=QueueTimeMs,request=Produce" attributes="OneMinuteRate,Max,Min,75thPercentile,95thPercentile" resultAlias="Kafka.RequestMetrics.QueueTimeMs.Produce.#attribute#"/>
<query objectName="kafka.network:type=RequestMetrics,name=QueueTimeMs,request=FetchConsumer" attributes="OneMinuteRate,Max,Min,75thPercentile,95thPercentile" resultAlias="Kafka.RequestMetrics.QueueTimeMs.FetchConsumer.#attribute#"/>
<query objectName="kafka.network:type=RequestMetrics,name=QueueTimeMs,request=Produce" attributes="OneMinuteRate,Max,Min,75thPercentile,95thPercentile" resultAlias="Kafka.RequestMetrics.QueueTimeMs.Produce.#attribute#"/>
<!-- Time the request being processed at the leader -->
<query objectName="kafka.network:type=RequestMetrics,name=LocalTimeMs,request=Produce" attributes="Max,Min,75thPercentile,95thPercentile" resultAlias="Kafka.RequestMetrics.LocalTimeMs.Produce.#attribute#"/>
<query objectName="kafka.network:type=RequestMetrics,name=LocalTimeMs,request=FetchConsumer" attributes="Max,Min,75thPercentile,95thPercentile" resultAlias="Kafka.RequestMetrics.LocalTimeMs.FetchConsumer.#attribute#"/>
<query objectName="kafka.network:type=RequestMetrics,name=LocalTimeMs,request=FetchFollower" attributes="Max,Min,75thPercentile,95thPercentile" resultAlias="Kafka.RequestMetrics.LocalTimeMs.FetchFollower.#attribute#"/>
<!-- Time the request waits for the follower -->
<query objectName="kafka.network:type=RequestMetrics,name=RemoteTimeMs,request=Produce" attributes="Max,Min,75thPercentile,95thPercentile" resultAlias="Kafka.RequestMetrics.RemoteTimeMs.Produce.#attribute#"/>
<query objectName="kafka.network:type=RequestMetrics,name=RemoteTimeMs,request=FetchConsumer" attributes="Max,Min,75thPercentile,95thPercentile" resultAlias="Kafka.RequestMetrics.RemoteTimeMs.FetchConsumer.#attribute#"/>
<query objectName="kafka.network:type=RequestMetrics,name=RemoteTimeMs,request=FetchFollower" attributes="Max,Min,75thPercentile,95thPercentile" resultAlias="Kafka.RequestMetrics.RemoteTimeMs.FetchFollower.#attribute#"/>
<!-- Time to send the response -->
<query objectName="kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=Produce" attributes="Max,Min,75thPercentile,
95thPercentile" resultAlias="Kafka.RequestMetrics.ResponseSendTimeMs.Produce.#attribute#"/>
<query objectName="kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=FetchConsumer" attributes="Max,Min,75thPercentile,
95thPercentile" resultAlias="Kafka.RequestMetrics.ResponseSendTimeMs.FetchConsumer.#attribute#"/>
<query objectName="kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request=FetchFollower" attributes="Max,Min,75thPercentile,95thPercentile" resultAlias="Kafka.RequestMetrics.ResponseSendTimeMs.FetchFollower.#attribute#"/>
<!-- Number of messages the consumer lags behind the producer by -->
<query objectName="kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+)" resultAlias="Kafka.ConsumerFetcherManager.MaxLag.clientId.#attribute#"/>
<!-- The average fraction of time the network processors are idle -->
<query objectName="kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent" resultAlias="Kafka.SocketServer.NetworkProcessorAvgIdlePercent.#attribute#"/>
<!-- The average fraction of time the request handler threads are idle -->
<query objectName="kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent" attributes="OneMinuteRate" resultAlias="Kafka.KafkaRequestHandlerPool.RequestHandlerAvgIdlePercent.#attribute#"/>
<!-- Quota metrics per client-id -->
<query objectName="kafka.server:type=Produce,client-id=([-.\w]+)" resultAlias="Kafka.Produce.client-id.#attribute#"/>
<query objectName="kafka.server:type=Fetch,client-id=([-.\w]+)" resultAlias="Kafka.Fetch.client-id.#attribute#"/>
</queries>
<!--
<outputWriter class="org.jmxtrans.agent.RollingFileOutputWriter">
<fileName>/tmp/roll-jmxing.log</fileName>
<maxFileSize>1024</maxFileSize>
<maxBackupIndex>10</maxBackupIndex>
</outputWriter>
-->
<outputWriter class="org.jmxtrans.agent.FileOverwriterOutputWriter">
<fileName>/tmp/jmxing.log</fileName>
<showTimeStamp>false</showTimeStamp>
</outputWriter>
<collectIntervalInSeconds>60</collectIntervalInSeconds>
</jmxtrans-agent>
consumer监控
Fetch Metrics: kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.w]+)
records-lag-max The maximum lag in terms of number of records for any partition in this window. An increasing value over time is your best indication that the consumer group is not keeping up with the producers.
fetch-size-avg The average number of bytes fetched per request.
fetch-size-max The max number of bytes fetched per request.
bytes-consumed-rate The average number of bytes consumed per second.
records-per-request-avg The average number of records in each request.
records-consumed-rate The average number of records consumed per second
fetch-rate The number of fetch requests per second.
fetch-latency-avg The average time taken for a fetch request.
fetch-latency-max The max time taken for a fetch request.
fetch-throttle-time-avg The average throttle time in ms. When quotas are enabled, the broker may delay fetch requests in order to throttle a consumer which has exceeded its limit. This metric indicates how throttling time has been added to fetch requests on average.
fetch-throttle-time-avg The maximum throttle time in ms
Topic-level Fetch Metrics: kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.w]+),topic=([-.w]+)
fetch-size-avg The average number of bytes fetched per request for a specific topic.
fetch-size-max The maximum number of bytes fetched per request for a specific topic.
bytes-consumed-rate The average number of bytes consumed per second for a specific topic.
records-per-request-avg The average number of records in each request for a specific topic.
records-consumed-rate The average number of records consumed per second for a specific topic.
Consumer Group Metrics: kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.w]+)
assigned-partitions The number of partitions currently assigned to this consumer.
commit-latency-avg The average time taken for a commit request.
commit-latency-max The max time taken for a commit request
commit-rate The number of commit calls per second.
join-rate The number of group joins per second. Group joining is the first phase of the rebalance protocol. A large value indicates that the consumer group is unstable and will likely be coupled with increased lag.
join-time-avg The average time taken for a group rejoin. This value can get as high as the configured session timeout for the consumer, but should usually be lower.
join-time-max The max time taken for a group rejoin. This value should not get much higher than the configured session timeout for the consumer.
sync-rate The number of group syncs per second. Group synchronization is the second and last phase of the rebalance protocol. Similar to join-rate, a large value indicates group instability.
sync-time-avg The average time taken for a group sync.
sync-time-max The max time taken for a group sync.
heartbeat-rate The average number of heartbeats per second. After a rebalance, the consumer sends heartbeats to the coordinator to keep itself active in the group. You can control this using the heartbeat.interval.ms setting for the consumer. You may see a lower rate than configured if the processing loop is taking more time to handle message batches. Usually this is OK as long as you see no increase in the join rate.
heartbeat-response-time-max The max time taken to receive a response to a heartbeat request.
last-heartbeat-seconds-ago The number of seconds since the last controller heartbeat.
Global Request Metrics: kafka.consumer:type=consumer-metrics,client-id=([-.w]+)
request-latency-avg The average request latency in ms.
request-latency-max The maximum request latency in ms.
request-rate The average number of requests sent per second.
response-rate The average number of responses received per second.
incoming-byte-rate The average number of incoming bytes received per second from all servers.
outgoing-byte-rate The average number of outgoing bytes sent per second to all servers.
Global Connection Metrics: kafka.consumer:type=consumer-metrics,client-id=([-.w]+)
connection-count The current number of active connections.
connection-creation-rate New connections established per second in the window.
connection-close-rate Connections closed per second in the window.
io-ratio The fraction of time the I/O thread spent doing I/O.
io-time-ns-avg The average length of time for I/O per select call in nanoseconds.
io-wait-ratio The fraction of time the I/O thread spent waiting.
select-rate Number of times the I/O layer checked for new I/O to perform per second.
io-wait-time-ns-avg The average length of time the I/O thread spent waiting for a socket ready for reads or writes in nanoseconds.
Per-Broker Metrics: kafka.consumer:type=consumer-node-metrics,client-id=([-.w]+),node-id=([0-9]+)
request-size-max The maximum size of any request sent in the window for a broker.
request-size-avg The average size of all requests in the window for a broker.
request-rate The average number of requests sent per second to the broker.
response-rate The average number of responses received per second from the broker.
incoming-byte-rate The average number of bytes received per second from the broker.
outgoing-byte-rate The average number of bytes sent per second to the broker.
producer监控
Global Request Metrics: kafka.producer:type=producer-metrics,client-id=([-.w]+)
request-latency-avg The average request latency in ms.
request-latency-max The maximum request latency in ms.
request-rate The average number of requests sent per second.
response-rate The average number of responses received per second.
incoming-byte-rate The average number of incoming bytes received per second from all servers.
outgoing-byte-rate The average number of outgoing bytes sent per second to all servers.
Global Connection Metrics: kafka.producer:type=producer-metrics,client-id=([-.w]+)
connection-count The current number of active connections.
connection-creation-rate New connections established per second in the window.
connection-close-rate Connections closed per second in the window.
io-ratio The fraction of time the I/O thread spent doing I/O.
io-time-ns-avg The average length of time for I/O per select call in nanoseconds.
io-wait-ratio The fraction of time the I/O thread spent waiting.
select-rate Number of times the I/O layer checked for new I/O to perform per second.
io-wait-time-ns-avg The average length of time the I/O thread spent waiting for a socket ready for reads or writes in nanoseconds.
Per-Broker Metrics: kafka.producer:type=producer-node-metrics,client-id=([-.w]+),node-id=([0-9]+)
request-size-max The maximum size of any request sent in the window for a broker.
request-size-avg The average size of all requests in the window for a broker.
request-rate The average number of requests sent per second to the broker.
response-rate The average number of responses received per second from the broker.
incoming-byte-rate The average number of bytes received per second from the broker.
outgoing-byte-rate The average number of bytes sent per second to the broker.
Per-Topic Metrics: kafka.producer:type=producer-topic-metrics,client-id=([-.w]+),topic=([-.w]+)
byte-rate The average number of bytes sent per second for a topic.
record-send-rate The average number of records sent per second for a topic.
compression-rate The average compression rate of record batches for a topic.
record-retry-rate The average per-second number of retried record sends for a topic.
record-error-rate The average per-second number of record sends that resulted in errors for a topic.