1.邮件告警
(1).修改alertmanager配置文件
[root@docker-3 alertmanager]# cat alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.163.com:25'
smtp_from: 'jumpservervip@163.com'
smtp_auth_username: 'jumpservervip@163.com'
smtp_auth_password: 'xxx'
smtp_require_tls: false
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: 'jumpservervip@126.com'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
检查配置
[root@docker-3 alertmanager]# ./amtool check-config alertmanager.yml
[root@docker-3 alertmanager]# systemctl start alertmanager
(2).修改prometheus 配置文件
[root@docker-3 alertmanager]# vim /usr/local/prometheus/prometheus.yml
//1、修改prometheus.yml 的alerting 部分
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 172.16.0.9:9093
//2、定义告警文件:
rule_files:
- rules/*.yml
2.编写告警规则
[root@docker-3 alertmanager]# cd /usr/local/prometheus
[root@docker-3 alertmanager]# mkdir rules
[root@docker-3 alertmanager]# cd rules/
[root@docker-3 rules]# cat host_monitor.yml
groups:
- name: node-up
rules:
- alert: node-up
expr: up == 0
for: 15s
labels:
severity: 1
team: node
annotations:
summary: "{{$labels.instance}}Instance has been down for more than 5 minutes"
//检查配置文件
[root@docker-3 alertmanager]# /usr/local/prometheus/promtool check config /usr/local/prometheus/prometheus.yml
//重启Prometheus
[root@docker-3 alertmanager]# systemctl restart prometheus
- alert:告警规则的名称。
- expr:基于PromQL 表达式告警触发条件,用于计算是否有时间序列满足该条件。
- for:评估等待时间,可选参数。用于表示只有当触发条件持续一段时间后才发送告警。在
等待期间新产生告警的状态为pending。 - labels:自定义标签,允许用户指定要附加到告警上的一组附加标签。
- annotations:用于指定一组附加信息,比如用于描述告警详细信息的文字等,annotations
的内容在告警产生时会一同作为参数发送到Alertmanager。 - summary 描述告警的概要信息,description 用于描述告警的详细信息。
- 同时Alertmanager 的UI 也会根据这两个标签值,显示告警信息。
状态说明Prometheus Alert 告警状态有三种状态:Inactive、Pending、Firing。 - 1、Inactive:非活动状态,表示正在监控,但是还未有任何警报触发。
- 2、Pending:表示这个警报必须被触发。由于警报可以被分组、压抑/抑制或静默/静音,所
以等待验证,一旦所有的验证都通过,则将转到Firing 状态。 - 3、Firing:将警报发送到AlertManager,它将按照配置将警报的发送给所有接收者。一旦警
报解除,则将状态转到Inactive,如此循环。
[root@docker-3 rules]# systemctl stop node_exporter ##停止观察,模拟宕机,检测到错误并告警
3.优化告警模板
(1).新建模板文件
[root@docker-3 rules]# cat /usr/local/alertmanager/email.tmpl
{{ define "email.to.html" }}
{{ range .Alerts }}
=========start==========<br>
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} 级<br>
告警类型: {{ .Labels.alertname }} <br>
故障主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }} <br>
告警详情: {{ .Annotations.description }} <br>
触发时间: {{ .StartsAt }} <br>
=========end==========<br>
{{ end }}
{{ end }}
(2).修改配置文件使用模板
[root@docker-3 rules]# cat /usr/local/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.163.com:25'
smtp_from: 'jumpservervip@163.com'
smtp_auth_username: 'jumpservervip@163.com'
smtp_auth_password: 'xxx'
smtp_require_tls: false
templates:
- '/usr/local/alertmanager/email.tmpl'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: 'jumpservervip@126.com'
html: '{{ template "email.to.html" . }}' ##使用模板的方式发送
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
(3).告警恢复
在配置的时候,加上:send_resolved: true
1、修改模板添加恢复信息
[root@docker-3 rules]# cat /usr/local/alertmanager/email.tmpl
{{ define "email.to.html" }}
{{ if gt (len .Alerts.Firing) 0 }}{{ range .Alerts }}
@告警
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} 级<br>
告警类型: {{ .Labels.alertname }} <br>
故障主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }} <br>
告警详情: {{ .Annotations.description }} <br>
触发时间: {{ .StartsAt }} <br>
{{ end }}
{{ end }}
{{ if gt (len .Alerts.Resolved) 0 }}{{ range .Alerts }}
@恢复:
告警主机:{{ .Labels.instance }} <br>
告警主题:{{ .Annotations.summary }} <br>
恢复时间: {{ .EndsAt }} <br>
{{ end }}
{{ end }}
{{ end }}
4.企业微信告警
(1).测试企业微信
测试账户可用性
https://work.weixin.qq.com/api/devtools/devtool.php
corp_id: 企业微信账号唯一 ID, 可以在我的企业中查看。
to_party: 需要发送的组(部门)。
agent_id: 第三方企业应用的 ID
api_secret: 第三方企业应用的密钥
(2).修改模板
[root@docker-3 alertmanager]# cat /usr/local/alertmanager/wechat.tmpl
{{ define "wechat.tmpl" }}
{{- if gt (len .Alerts.Firing) 0 -}}{{ range .Alerts }}
@警报
实例: {{ .Labels.instance }}
信息: {{ .Annotations.summary }}
详情: {{ .Annotations.description }}
时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{ end }}{{ end -}}
{{- if gt (len .Alerts.Resolved) 0 -}}{{ range .Alerts }}
@恢复
实例: {{ .Labels.instance }}
信息: {{ .Annotations.summary }}
时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
恢复: {{ .EndsAt.Format "2006-01-02 15:04:05" }}
{{ end }}{{ end -}}
{{- end }}
(3).修改配置
[root@docker-3 alertmanager]# cat /usr/local/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
templates:
- '/usr/local/alertmanager/wechat.tmpl'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'wechat'
receivers:
- name: 'wechat'
wechat_configs:
- corp_id: 'wwf4ee8ede83b63a1a'
to_party: '1'
agent_id: '1000003'
api_secret: 'LbVzYRczEJMY2rq0c8I8ZjASPfCtzvl3f7zfiuyVKSc'
send_resolved: true
message: '{{ template "wechat.tmpl" . }}'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
5.告警的标签、路由、分组
(1).标签
给每个监控项添加标签
/usr/local/prometheus/rules/mysql.yml
如下面的标签定义为
labels:
severity: warning
(2).路由
routes:
- match:
severity: critical
receiver: 'leader'
continue: true
- match_re:
severity: ^(warning|critical)$
receiver: 'devops'
continue: true
(3).分组
route:
group_by: [severity]
(4).配置
[root@docker-3 alertmanager]# cat alertmanager.yml
global:
resolve_timeout: 10s
smtp_smarthost: 'smtp.163.com:25'
smtp_from: 'jumpservervip@163.com'
smtp_auth_username: 'jumpservervip@163.com'
smtp_auth_password: 'xxx'
smtp_require_tls: false
templates:
- '/usr/local/alertmanager/*.tmpl'
route:
group_by: [severity]
group_wait: 10s
group_interval: 3m
repeat_interval: 3m
receiver: 'email'
routes:
- match:
severity: critical
receiver: 'leader'
continue: true
- match_re:
severity: ^(warning|critical)$
receiver: 'devops'
continue: true
receivers:
- name: 'email'
email_configs:
- to: 'jumpservervip@126.com'
html: '{{ template "email.to.html" . }}'
send_resolved: true
- name: 'leader'
email_configs:
- to: 'jumpservervip@163.com'
html: '{{ template "email.to.html" . }}'
send_resolved: true
- name: 'devops'
wechat_configs:
- corp_id: 'wwf4ee8ede83b63a1a'
to_party: '1'
agent_id: '1000003'
api_secret: 'LbVzYRczEJMY2rq0c8I8ZjASPfCtzvl3f7zfiuyVKSc'
send_resolved: true
message: '{{ template "wechat.tmpl" . }}'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']