内容概要
上一篇主要说了如何安装,本篇主要对监控配置文件进行说明
配置说明
参数名称 | 说明 | 默认值 | 参数所属 |
scrape_interval | 指标数据采集间隔 | 1分钟 | prometheus.yml |
evaluation_interval | 规则的计算间隔 | 1分钟 | prometheus.yml |
for: 时间 | 异常持续多长时间发送告警 | 0 | 规则配置 |
group_wait | 分组等待时间。同一分组内收到第一个告警等待多久开始发送,目的是为了同组消息同时发送 | 30秒 | alertmanager.yml |
group_interval | 上下两组发送告警的间隔时间。第一次告警发出后等待group_interval时间,开始为该组触发新告警 | 5分钟 | alertmanager.yml |
repeat_interval | 重发间隔。告警已经发送,且无新增告警,再次发送告警需要的间隔时间 | 4小时 | alertmanager.yml |
# prometheus.yml配置
global:
scrape_interval: 20s
evaluation_interval: 30s
# 规则配置
- alert: kakfa_down
expr: kakfa_up_status == 0
for: 1m
annotations:
summary: "Kafka挂掉了"
# alertmanager配置
route:
group_by: [alertname]
group_wait: 60s
group_interval: 5m
repeat_interval: 10m
事件流程
10:00:05 Kafka挂掉了
10:00:20 拉取指标kakfa_up_status=0
10:00:30 计算规则,发现Kafka挂掉了,将kakfa_down设置为pending
10:00:30~10:01:30 持续拉取指标、计算规则
10:01:30 kafka_down持续时间达到了1分钟,设置为firing,发送到alertmanager
10:01:30 alertmanager收到后,等待分组等待时间
10:02:30 分组等待时间完成,发出告警
10:12:30 告警还没有解决,重复发出告警
relabel简介
为了更好的识别监控指标,便于后期调用数据绘图、告警等需求,prometheus支持对发现的目标进行label修改,可以在目标被抓取之前动态重写目标的标签集。每个抓取配置可以配置多个重新标记步骤。它们按照它们在配置文件中出现的顺序应用于每个目标的标签集。
除了配置的每个目标标签之外,prometheus还会自动添加几个标签:
job标签:设置为job_name相应的抓取配置的值。
instance标签:__address__设置为目标的地址:。重新标记后,如果在重新标记期间未设置标签,则默认将__address__标签值赋值给instance。
schema:协议类型
__metrics_path:抓取指标数的url
scrape_interval:scrape抓取数据时间间隔(秒)
scrape_timeout:scrape超时时间(秒)
__meta_在重新标记阶段可能会提供带有前缀的附加标签。它们由提供目标的服务发现机制设置,并且因机制而异。
__目标重新标记完成后,将从标签集中删除以开头的标签。
如果重新标记步骤只需要临时存储标签值(作为后续重新标记步骤的输入),可以使用__tmp标签名称前缀。这个前缀保证不会被 Prometheus 本身使用。
常用的在以下两个阶段可以重新标记:
relabel_configs:在采集之前(比如在采集数据之前重新定义元标签),可以使用relabel_configs添加一些标签、也可以只采集特定目标或过滤目标
metric_relabel_configs:如果是已经抓取到指标数据时,可以使用metric_relabel_configs做最后的重新标记和过滤
配置监控项
下载地址 https://prometheus.io/download/#node_exporter
主机监控
wget https://github.com/prometheus/node_exporter/releases/download/v*/node_exporter-*.*-amd64.tar.gz
tar xvfz node_exporter-*.*-amd64.tar.gz
cd node_exporter-*.*-amd64
./node_exporter
导入模版1860
kafka监控
wget https://github.com/danielqsj/kafka_exporter/releases/download/v1.2.0/kafka_exporter-1.2.0.linux-amd64.tar.gz
tar -xvf kafka_exporter-v1.2.0.linux-amd64.tar.gz
mv kafka_exporter-v1.2.0.linux-amd64 /data/kafka_exporter
cd /data/kafka_exporter
nohup ./kafka_exporter --kafka.server=kafkaIP或者域名:9092 &
time="2022-11-09T15:17:56+08:00" level=info msg="Starting kafka_exporter (version=1.2.0, branch=HEAD, revision=830660212e6c109e69dcb1cb58f5159fe3b38903)" source="kafka_exporter.go:474"
time="2022-11-09T15:17:56+08:00" level=info msg="Build context (go=go1.10.3, user=root@981cde178ac4, date=20180707-14:34:48)" source="kafka_exporter.go:475"
time="2022-11-09T15:17:56+08:00" level=info msg="Done Init Clients" source="kafka_exporter.go:213"
time="2022-11-09T15:17:56+08:00" level=info msg="Listening on :9308" source="kafka_exporter.go:499"
导入模板7589
Redis监控
wget https://github.com/oliver006/redis_exporter/releases/download/v1.3.2/redis_exporter-v1.3.2.linux-amd64.tar.gz
tar -xvf redis_exporter-v1.3.2.linux-amd64.tar.gz
mv redis_exporter-v1.3.2.linux-amd64 /data/redis_exporter
nohup ./redis_exporter -redis.addr 192.168.0.11:7001(注意不要使用sentinal端口) -redis.password Redis@2022 &
time="2022-11-09T14:39:10+08:00" level=info msg="Redis Metrics Exporter v1.3.2 build date: 2019-11-06-02:25:20 sha1: 175a69f33e8267e0a0ba47caab488db5e83a592e Go: go1.13.4 GOOS: linux GOARCH: amd64"
time="2022-11-09T14:39:10+08:00" level=info msg="Providing metrics at :9121/metrics"
修改Prometheus的配置文件prometheus.yml
- job_name: redis
static_configs:
- targets: ['172.26.42.229:9121']
labels:
instance: redis120
集群redis监控
- job_name: 'redis_exporter_targets'
static_configs:
- targets:
- redis://192.168.0.11:7001
- redis://192.168.0.12:7001
- redis://192.168.0.13:7001
metrics_path: /scrape
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 192.168.0.11:9121
- job_name: 'redis'
metrics_path: /metrics
static_configs:
- targets: ['192.168.0.11:9121']
导入11835
mysql监控
wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.14.0/mysqld_exporter-0.14.0.linux-amd64.tar.gz
tar xvf mysqld_exporter-0.14.0.linux-amd64.tar.gz
mv mysqld_exporter-0.14.0.linux-amd64 /data/mysqld_exporter
vim /data/mysqld_exporter/.my.cnf
[client]
user=mysqlexpoter
password=prometheus
host=192.168.xx.xx
port=3306
nohup ./mysqld_exporter --config.my-cnf=/data/mysqld_exporter/.my.cnf &
ts=2022-11-09T07:25:16.492Z caller=mysqld_exporter.go:303 level=info msg="Listening on address" address=:9104
ts=2022-11-09T07:25:16.492Z caller=tls_config.go:195 level=info msg="TLS is disabled." http2=false
若您需要获取MySQL数据库类型的监控指标数据,需要在MySQL数据库中开通相关的权限,将mysqld_exporter连接到MySQL数据库,本文介绍如何设置MySQL数据库的mysqld_exporter权限。
在MySQL数据库中为mysqld_exporter创建一个用户,用户密码可以自行设置。然后执行如下命令,为performance_schema.* 表添加读权限。
mysql> GRANT REPLICATION CLIENT, PROCESS ON *.* TO
'mysqld_exporter'@'localhost' identified by 'arms_prometheus2022';
mysql> FLUSH PRIVILEGES;
说明 mysqld_exporter和arms_prometheus2022是自定义的用户名称和密码,请根据实际情况替换。
导入模版7362
配置告警规则
服务器告警规则
修改Prometheus配置文件prometheus.yml,添加以下配置:
rule_files:
- /etc/prometheus/rules/*.rules
热加载更新配置
在 Prometheus 的日常维护中,一定会对配置文件 prometheus.yml 进行再编辑操作,通常对 Prometheus 服务进行重启操作即可完成对配
置文件的加载。
当然也可以通过动态的热加载来更新 prometheus.yml 中的配置信息,一般热加载有两种方法:
1、查看 Prometheus 的进程 id,进程发送 SIGHUP 信号:
kill -HUP pid
2、通过HTTP API 发送 post 请求到 /-/reload:
curl -X POST http://localhost:9090/-/reload
若使用第二种方式进行热加载操作,需要在 Prometheus 服务启动时指定 --web.enable-lifecycle,添加到以上的 Prometheus 自启动文件中使用。
systemctl daemon-reload
在目录/etc/prometheus/rules/下创建告警文件hoststats-alert.rules内容如下:
groups:
- name: hostStatsAlert
rules:
- alert: hostCpuUsageAlert
expr: sum(avg without (cpu)(irate(node_cpu{mode!='idle'}[5m]))) by (instance) > 0.85
for: 1m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} CPU usgae high"
description: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }})"
- alert: hostMemUsageAlert
expr: (node_memory_MemTotal - node_memory_MemAvailable)/node_memory_MemTotal > 0.85
for: 1m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} MEM usgae high"
description: "{{ $labels.instance }} MEM usage above 85% (current value: {{ $value }})"
重启Prometheus后访问Prometheus UIhttp://127.0.0.1:9090/rules可以查看当前以加载的规则文件。
[root@grafana rules]# cat node_exporter_rules.yml
# 服务器资源告警策略
groups:
- name: 服务器资源监控
rules:
- alert: 内存使用率过高
expr: (node_memory_Buffers_bytes+node_memory_Cached_bytes+node_memory_MemFree_bytes)/node_memory_MemTotal_bytes*100 > 90
for: 5m # 告警持续时间,超过这个时间才会发送给alertmanager
labels:
severity: 严重告警
annotations:
summary: "{{ $labels.instance }} 内存使用率过高,请尽快处理!"
description: "{{ $labels.instance }}内存使用率超过90%,当前使用率{{ $value }}%."
- alert: 服务器宕机
expr: up == 0
for: 3m
labels:
severity: 严重告警
annotations:
summary: "{{$labels.instance}} 服务器宕机,请尽快处理!"
description: "{{$labels.instance}} 服务器延时超过3分钟,当前状态{{ $value }}. "
- alert: CPU高负荷
expr: 100 - (avg by (instance,job)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 5m
labels:
severity: 严重告警
annotations:
summary: "{{$labels.instance}} CPU使用率过高,请尽快处理!"
description: "{{$labels.instance}} CPU使用大于90%,当前使用率{{ $value }}%. "
- alert: 磁盘IO性能
expr: avg(irate(node_disk_io_time_seconds_total[1m])) by(instance,job)* 100 > 90
for: 5m
labels:
severity: 严重告警
annotations:
summary: "{{$labels.instance}} 流入磁盘IO使用率过高,请尽快处理!"
description: "{{$labels.instance}} 流入磁盘IO大于90%,当前使用率{{ $value }}%."
- alert: 网络流入
expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance,job)) / 100) > 102400
for: 5m
labels:
severity: 严重告警
annotations:
summary: "{{$labels.instance}} 流入网络带宽过高,请尽快处理!"
description: "{{$labels.instance}} 流入网络带宽持续5分钟高于100M. RX带宽使用量{{$value}}."
- alert: 网络流出
expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance,job)) / 100) > 102400
for: 5m
labels:
severity: 严重告警
annotations:
summary: "{{$labels.instance}} 流出网络带宽过高,请尽快处理!"
description: "{{$labels.instance}} 流出网络带宽持续5分钟高于100M. RX带宽使用量{$value}}."
- alert: TCP连接数
expr: node_netstat_Tcp_CurrEstab > 10000
for: 2m
labels:
severity: 严重告警
annotations:
summary: " TCP_ESTABLISHED过高!"
description: "{{$labels.instance}} TCP_ESTABLISHED大于100%,当前使用率{{ $value }}%."
- alert: 磁盘容量
expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 90
for: 1m
labels:
severity: 严重告警
annotations:
summary: "{{$labels.mountpoint}} 磁盘分区使用率过高,请尽快处理!"
description: "{{$labels.instance}} 磁盘分区使用大于90%,当前使用率{{ $value }}%."
Mysql告警规则
groups:
- name: MySQLStatsAlert
rules:
- alert: MySQL is down
expr: mysql_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} MySQL is down"
description: "MySQL database is down. This requires immediate action!"
- alert: open files high
expr: mysql_global_status_innodb_num_open_files > (mysql_global_variables_open_files_limit) * 0.75
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} open files high"
description: "Open files is high. Please consider increasing open_files_limit."
- alert: Read buffer size is bigger than max. allowed packet size
expr: mysql_global_variables_read_buffer_size > mysql_global_variables_slave_max_allowed_packet
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} Read buffer size is bigger than max. allowed packet size"
description: "Read buffer size (read_buffer_size) is bigger than max. allowed packet size (max_allowed_packet).This can break your replication."
- alert: Sort buffer possibly missconfigured
expr: mysql_global_variables_innodb_sort_buffer_size <256*1024 or mysql_global_variables_read_buffer_size > 4*1024*1024
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} Sort buffer possibly missconfigured"
description: "Sort buffer size is either too big or too small. A good value for sort_buffer_size is between 256k and 4M."
- alert: Thread stack size is too small
expr: mysql_global_variables_thread_stack <196608
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} Thread stack size is too small"
description: "Thread stack size is too small. This can cause problems when you use Stored Language constructs for example. A typical is 256k for thread_stack_size."
- alert: Used more than 80% of max connections limited
expr: mysql_global_status_max_used_connections > mysql_global_variables_max_connections * 0.8
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} Used more than 80% of max connections limited"
description: "Used more than 80% of max connections limited"
- alert: InnoDB Force Recovery is enabled
expr: mysql_global_variables_innodb_force_recovery != 0
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} InnoDB Force Recovery is enabled"
description: "InnoDB Force Recovery is enabled. This mode should be used for data recovery purposes only. It prohibits writing to the data."
- alert: InnoDB Log File size is too small
expr: mysql_global_variables_innodb_log_file_size < 16777216
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} InnoDB Log File size is too small"
description: "The InnoDB Log File size is possibly too small. Choosing a small InnoDB Log File size can have significant performance impacts."
- alert: InnoDB Flush Log at Transaction Commit
expr: mysql_global_variables_innodb_flush_log_at_trx_commit != 1
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} InnoDB Flush Log at Transaction Commit"
description: "InnoDB Flush Log at Transaction Commit is set to a values != 1. This can lead to a loss of commited transactions in case of a power failure."
- alert: Table definition cache too small
expr: mysql_global_status_open_table_definitions > mysql_global_variables_table_definition_cache
for: 1m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} Table definition cache too small"
description: "Your Table Definition Cache is possibly too small. If it is much too small this can have significant performance impacts!"
- alert: Table open cache too small
expr: mysql_global_status_open_tables >mysql_global_variables_table_open_cache * 99/100
for: 1m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} Table open cache too small"
description: "Your Table Open Cache is possibly too small (old name Table Cache). If it is much too small this can have significant performance impacts!"
- alert: Thread stack size is possibly too small
expr: mysql_global_variables_thread_stack < 262144
for: 1m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} Thread stack size is possibly too small"
description: "Thread stack size is possibly too small. This can cause problems when you use Stored Language constructs for example. A typical is 256k for thread_stack_size."
- alert: InnoDB Buffer Pool Instances is too small
expr: mysql_global_variables_innodb_buffer_pool_instances == 1
for: 1m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} InnoDB Buffer Pool Instances is too small"
description: "If you are using MySQL 5.5 and higher you should use several InnoDB Buffer Pool Instances for performance reasons. Some rules are: InnoDB Buffer Pool Instance should be at least 1 Gbyte in size. InnoDB Buffer Pool Instances you can set equal to the number of cores of your machine."
- alert: InnoDB Plugin is enabled
expr: mysql_global_variables_ignore_builtin_innodb == 1
for: 1m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} InnoDB Plugin is enabled"
description: "InnoDB Plugin is enabled"
- alert: Binary Log is disabled
expr: mysql_global_variables_log_bin != 1
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} Binary Log is disabled"
description: "Binary Log is disabled. This prohibits you to do Point in Time Recovery (PiTR)."
- alert: Binlog Cache size too small
expr: mysql_global_variables_binlog_cache_size < 1048576
for: 1m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} Binlog Cache size too small"
description: "Binlog Cache size is possibly to small. A value of 1 Mbyte or higher is OK."
- alert: Binlog Statement Cache size too small
expr: mysql_global_variables_binlog_stmt_cache_size <1048576 and mysql_global_variables_binlog_stmt_cache_size > 0
for: 1m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} Binlog Statement Cache size too small"
description: "Binlog Statement Cache size is possibly to small. A value of 1 Mbyte or higher is typically OK."
- alert: Binlog Transaction Cache size too small
expr: mysql_global_variables_binlog_cache_size <1048576
for: 1m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} Binlog Transaction Cache size too small"
description: "Binlog Transaction Cache size is possibly to small. A value of 1 Mbyte or higher is typically OK."
- alert: Sync Binlog is enabled
expr: mysql_global_variables_sync_binlog == 1
for: 1m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} Sync Binlog is enabled"
description: "Sync Binlog is enabled. This leads to higher data security but on the cost of write performance."
- alert: IO thread stopped
expr: mysql_slave_status_slave_io_running != 1
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} IO thread stopped"
description: "IO thread has stopped. This is usually because it cannot connect to the Master any more."
- alert: SQL thread stopped
expr: mysql_slave_status_slave_sql_running == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} SQL thread stopped"
description: "SQL thread has stopped. This is usually because it cannot apply a SQL statement received from the master."
- alert: Slave lagging behind Master
expr: rate(mysql_slave_status_seconds_behind_master[1m]) >30
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} Slave lagging behind Master"
description: "Slave is lagging behind Master. Please check if Slave threads are running and if there are some performance issues!"
- alert: Slave is NOT read only(Please ignore this warning indicator.)
expr: mysql_global_variables_read_only != 0
for: 1m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} Slave is NOT read only"
description: "Slave is NOT set to read only. You can accidentally manipulate data on the slave and get inconsistencies..."
保存热加载prometheus:
curl -XPOST localhost:9090/-/reload
配置调优
#Binlog Cache size too small 查询binlog缓存大小 show global status like 'bin%';
set global binlog_cache_size = 1048576;(立即生效重启后失效)
#Table open cache too small 查询打开表的数量 show global status like'open_tables'
# show global variables like 'table_open_cache';
set global table_open_cache = 根据打开的表数*1.2; (立即生效重启后失效)
# IO thread has stopped
Radis服务告警规则
[root@grafana rules]# cat redis_exporter_rules.yml
# Redis服务监控
groups:
- name: Redis-监控告警
rules:
- alert: 警报!Redis应用不可用
expr: redis_up == 0
for: 0m
labels:
severity: 严重告警
annotations:
summary: "{{ $labels.instance }} Redis应用不可用"
description: "Redis应用不可达\n 当前值 = {{ $value }}"
- alert: 警报!丢失Master节点
expr: (count(redis_instance_info{role="master"}) ) < 1
for: 0m
labels:
severity: 严重告警
annotations:
summary: "{{ $labels.instance }} 丢失Redis master"
description: "Redis集群当前没有主节点\n 当前值 = {{ $value }}"
- alert: 警报!脑裂,主节点太多
expr: count(redis_instance_info{role="master"}) > 1
for: 0m
labels:
severity: 严重告警
annotations:
summary: "{{ $labels.instance }} Redis脑裂,主节点太多"
description: "{{ $labels.instance }} 主节点太多\n 当前值 = {{ $value }}"
- alert: 警报!Slave连接不可达
expr: count without (instance, job) (redis_connected_slaves) - sum without (instance, job) (redis_connected_slaves) - 1 > 1
for: 0m
labels:
severity: 严重告警
annotations:
summary: "{{ $labels.instance }} Redis丢失slave节点"
description: "Redis slave不可达.请确认主从同步状态\n 当前值 = {{ $value }}"
- alert: 警报!Redis副本不一致
expr: delta(redis_connected_slaves[1m]) < 0
for: 0m
labels:
severity: 严重告警
annotations:
summary: "{{ $labels.instance }} Redis 副本不一致"
description: "Redis集群丢失一个slave节点\n 当前值 = {{ $value }}"
- alert: 警报!Redis集群抖动
expr: changes(redis_connected_slaves[1m]) > 1
for: 2m
labels:
severity: 严重告警
annotations:
summary: "{{ $labels.instance }} Redis集群抖动"
description: "Redis集群抖动,请检查.\n 当前值 = {{ $value }}"
- alert: 警报!持久化失败
expr: (time() - redis_rdb_last_save_timestamp_seconds) / 3600 > 24
for: 0m
labels:
severity: 严重告警
annotations:
summary: "{{ $labels.instance }} Redis持久化失败"
description: "Redis持久化失败(>24小时)\n 当前值 = {{ printf \"%.1f\" $value }}小时"
- alert: 警报!内存不足
expr: redis_memory_used_bytes / redis_total_system_memory_bytes * 100 > 90
for: 2m
labels:
severity: 一般告警
annotations:
summary: "{{ $labels.instance }}系统内存不足"
description: "Redis占用系统内存(> 90%)\n 当前值 = {{ printf \"%.2f\" $value }}%"
- alert: 警报!Maxmemory不足
expr: redis_config_maxmemory !=0 and redis_memory_used_bytes / redis_memory_max_bytes * 100 > 80
for: 2m
labels:
severity: 一般告警
annotations:
summary: "{{ $labels.instance }} Maxmemory设置太小"
description: "超出设置最大内存(> 80%)\n 当前值 = {{ printf \"%.2f\" $value }}%"
- alert: 警报!连接数太多
expr: redis_connected_clients > 200
for: 2m
labels:
severity: 一般告警
annotations:
summary: "{{ $labels.instance }} 实时连接数太多"
description: "连接数太多(>200)\n 当前值 = {{ $value }}"
- alert: 警报!连接数太少
expr: redis_connected_clients < 1
for: 2m
labels:
severity: 一般告警
annotations:
summary: "{{ $labels.instance }} 实时连接数太少"
description: "连接数(<1)\n 当前值 = {{ $value }}"
- alert: 警报!拒绝连接数
expr: increase(redis_rejected_connections_total[1m]) > 0
for: 0m
labels:
severity: 严重告警
annotations:
summary: "{{ $labels.instance }} 拒绝连接"
description: "Redis有拒绝连接,请检查连接数配置\n 当前值 = {{ printf \"%.0f\" $value }}"
- alert: 警报!执行命令数大于1000
expr: rate(redis_commands_processed_total[1m]) > 1000
for: 0m
labels:
severity: 严重告警
annotations:
summary: "{{ $labels.instance }} 执行命令次数太多"
description: "Redis执行命令次数太多\n 当前值 = {{ printf \"%.0f\" $value }}"
解决方法
#脑裂问题解决,在配置文件redis.conf中添加如下配置
min-slaves-to-write 1
min-slaves-max-lag 10
redis-server redis.conf
redis-sentinel sentinel.conf #sentinel模式
RabbitMQ服务告警规则
[root@grafana rules]# cat rabbitmq_exporter_rules.yml
# RabbitMQ服务监控
groups:
- name: RabbitMQ服务监控
rules:
- alert: RabbitMQ服务停止
expr: rabbitmq_up ==0
for: 3m
labels:
severity: 严重告警
annotations:
description: "{{$labels.instance}}RabbitMQ服务已停止,当前状态{{ $value }}"
summary: "RabbitMQ服务已停止3分钟,请尽快处理!"
- alert: RabbitMQ内存使用大于2G
expr: rabbitmq_node_mem_used/1024/1024 > 2048
for: 3m
labels:
severity: 严重告警
annotations:
description: "{{ $labels.instance }} RabbitMQ内存使占用过高 !"
value: '{{ $value }} MB'
summary: "RabbitMQ内存使占用大于2G"
kafka集群服务告警规则
[root@grafana rules]# cat kafka_exporter_rules.yml
# kafka集群服务监控
groups:
- name: kafka服务监控
rules:
- alert: kafka消费滞后
expr: sum(kafka_consumergroup_lag{topic!="sop_free_study_fix-student_wechat_detail"}) by (consumergroup, topic, job) > 50000
for: 3m
labels:
severity: 严重告警
annotations:
summary: "{{$labels.instance}} kafka消费滞后({{$.Labels.consumergroup}})"
description: "{{$.Labels.topic}}消费滞后超过5万持续3分钟(当前{{$value}})"
- alert: kafka集群节点减少
expr: kafka_brokers < 3 #kafka集群节点数3
for: 3m
labels:
severity: 严重告警
annotations:
summary: "kafka集群部分节点已停止,请尽快处理!"
description: "{{$labels.instance}} kafka集群节点减少"
- alert: emqx_rule_to_kafka最近五分钟内的每秒平均变化率为0
expr: sum(rate(kafka_topic_partition_current_offset{topic="emqx_rule_to_kafka"}[5m])) by ( instance,topic,job) ==0
for: 5m
labels:
severity: 严重告警
annotations:
summary: "{{$labels.instance}} emqx_rule_to_kafka未接收到消息"
description: "{{$.Labels.topic}}emqx_rule_to_kafka持续5分钟未接收到消息(当前{{$value}})"
域名SSL证书过期监控规则
[root@grafana rules]# cat ssl_expiry.yml
groups:
- name: SSL证书监测
rules:
- alert: 证书还有30天过期
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 300
for: 5m
labels:
severity: 重要告警
annotations:
summary: "SSL证书即将过期 (instance {{ $labels.instance }})"
description: "SSL证书即将30天内过期 VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: 证书已过期
expr: probe_ssl_earliest_cert_expiry - time() <= 0
for: 5m
labels:
severity: 严重告警
annotations:
summary: "SSL证书已经过期 (instance {{ $labels.instance }})"
description: "SSL证书已经过期\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
Elasticsearch集群告警规则
[root@grafana rules]# cat elasticsearch_exporter_rules.yml
groups:
- name: ElasticSearch服务监控
rules:
- alert: ES集群节点减少
expr: elasticsearch_cluster_health_number_of_nodes < 3 #ES集群节点数3
for: 5m
labels:
severity: 严重告警
annotations:
summary: "ES集群节点减少:{{$.Labels.job}}"
description: "ES集群节点数减少:{{$.Labels.job}},(当前:{{$value}})"
- alert: jvm内存使用率告警
expr: elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}*100 > 90
for: 5m
labels:
severity: 严重告警
annotations:
summary: "jvm内存使用率过高:{{$.Labels.job}}"
description: "jvm内存使用率过高:{{$.Labels.job}}大于90%,(当前:{{$value}})"