1.邮件告警

(1).修改alertmanager配置文件
[root@docker-3 alertmanager]# cat alertmanager.yml

global:
	resolve_timeout: 5m
	smtp_smarthost: 'smtp.163.com:25'
	smtp_from: 'jumpservervip@163.com'
	smtp_auth_username: 'jumpservervip@163.com'
	smtp_auth_password: 'xxx'
	smtp_require_tls: false
route:
	group_by: ['alertname']
	group_wait: 10s
	group_interval: 10s
	repeat_interval: 1h
	receiver: 'email'
receivers:
- name: 'email'
	email_configs:
	- to: 'jumpservervip@126.com'
		send_resolved: true
inhibit_rules:
	- source_match:
		severity: 'critical'
	target_match:
		severity: 'warning'
	equal: ['alertname', 'dev', 'instance']

检查配置

[root@docker-3 alertmanager]# ./amtool check-config alertmanager.yml

prometheus告警指标mongodb prometheus告警配置_html

[root@docker-3 alertmanager]# systemctl start alertmanager

prometheus告警指标mongodb prometheus告警配置_docker_02

(2).修改prometheus 配置文件
[root@docker-3 alertmanager]# vim /usr/local/prometheus/prometheus.yml

//1、修改prometheus.yml 的alerting 部分
# Alertmanager configuration
alerting:
	alertmanagers:
	- static_configs:
		- targets:
		- 172.16.0.9:9093

//2、定义告警文件:
rule_files:
	- rules/*.yml

2.编写告警规则

[root@docker-3 alertmanager]# cd /usr/local/prometheus
[root@docker-3 alertmanager]# mkdir rules
[root@docker-3 alertmanager]# cd rules/
[root@docker-3 rules]# cat host_monitor.yml
groups:
- name: node-up
	rules:
	- alert: node-up
		expr: up == 0
		for: 15s
		labels:
			severity: 1
			team: node
		annotations:
			summary: "{{$labels.instance}}Instance has been down for more than 5 minutes"
//检查配置文件
[root@docker-3 alertmanager]# /usr/local/prometheus/promtool check config /usr/local/prometheus/prometheus.yml
//重启Prometheus
[root@docker-3 alertmanager]# systemctl restart prometheus
  • alert:告警规则的名称。
  • expr:基于PromQL 表达式告警触发条件,用于计算是否有时间序列满足该条件。
  • for:评估等待时间,可选参数。用于表示只有当触发条件持续一段时间后才发送告警。在
    等待期间新产生告警的状态为pending。
  • labels:自定义标签,允许用户指定要附加到告警上的一组附加标签。
  • annotations:用于指定一组附加信息,比如用于描述告警详细信息的文字等,annotations
    的内容在告警产生时会一同作为参数发送到Alertmanager。
  • summary 描述告警的概要信息,description 用于描述告警的详细信息。
  • 同时Alertmanager 的UI 也会根据这两个标签值,显示告警信息。


    状态说明Prometheus Alert 告警状态有三种状态:Inactive、Pending、Firing。
  • 1、Inactive:非活动状态,表示正在监控,但是还未有任何警报触发。
  • 2、Pending:表示这个警报必须被触发。由于警报可以被分组、压抑/抑制或静默/静音,所
    以等待验证,一旦所有的验证都通过,则将转到Firing 状态。
  • 3、Firing:将警报发送到AlertManager,它将按照配置将警报的发送给所有接收者。一旦警
    报解除,则将状态转到Inactive,如此循环。

[root@docker-3 rules]# systemctl stop node_exporter ##停止观察,模拟宕机,检测到错误并告警

prometheus告警指标mongodb prometheus告警配置_配置文件_03

prometheus告警指标mongodb prometheus告警配置_配置文件_04

3.优化告警模板

(1).新建模板文件
[root@docker-3 rules]# cat /usr/local/alertmanager/email.tmpl
{{ define "email.to.html" }}
{{ range .Alerts }}
=========start==========<br>
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} 级<br>
告警类型: {{ .Labels.alertname }} <br>
故障主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }} <br>
告警详情: {{ .Annotations.description }} <br>
触发时间: {{ .StartsAt }} <br>
=========end==========<br>
{{ end }}
{{ end }}
(2).修改配置文件使用模板
[root@docker-3 rules]# cat /usr/local/alertmanager/alertmanager.yml
global:
	resolve_timeout: 5m
	smtp_smarthost: 'smtp.163.com:25'
	smtp_from: 'jumpservervip@163.com'
	smtp_auth_username: 'jumpservervip@163.com'
	smtp_auth_password: 'xxx'
	smtp_require_tls: false
templates:
	- '/usr/local/alertmanager/email.tmpl'
route:
	group_by: ['alertname']
	group_wait: 10s
	group_interval: 10s
	repeat_interval: 1h
	receiver: 'email'
receivers:
- name: 'email'
	email_configs:
	- to: 'jumpservervip@126.com'
		html: '{{ template "email.to.html" . }}' ##使用模板的方式发送
		send_resolved: true
inhibit_rules:
	- source_match:
		severity: 'critical'
	target_match:
		severity: 'warning'
	equal: ['alertname', 'dev', 'instance']

prometheus告警指标mongodb prometheus告警配置_Prometheus_05

(3).告警恢复

在配置的时候,加上:send_resolved: true
1、修改模板添加恢复信息

[root@docker-3 rules]# cat /usr/local/alertmanager/email.tmpl
{{ define "email.to.html" }}
{{ if gt (len .Alerts.Firing) 0 }}{{ range .Alerts }}
@告警
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} 级<br>
告警类型: {{ .Labels.alertname }} <br>
故障主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }} <br>
告警详情: {{ .Annotations.description }} <br>
触发时间: {{ .StartsAt }} <br>
{{ end }}
{{ end }}
{{ if gt (len .Alerts.Resolved) 0 }}{{ range .Alerts }}
@恢复:
告警主机:{{ .Labels.instance }} <br>
告警主题:{{ .Annotations.summary }} <br>
恢复时间: {{ .EndsAt }} <br>
{{ end }}
{{ end }}
{{ end }}

prometheus告警指标mongodb prometheus告警配置_docker_06

4.企业微信告警

(1).测试企业微信

测试账户可用性

https://work.weixin.qq.com/api/devtools/devtool.php

prometheus告警指标mongodb prometheus告警配置_html_07


corp_id: 企业微信账号唯一 ID, 可以在我的企业中查看。

to_party: 需要发送的组(部门)。

agent_id: 第三方企业应用的 ID

api_secret: 第三方企业应用的密钥

prometheus告警指标mongodb prometheus告警配置_Prometheus_08

(2).修改模板
[root@docker-3 alertmanager]# cat /usr/local/alertmanager/wechat.tmpl
{{ define "wechat.tmpl" }}
{{- if gt (len .Alerts.Firing) 0 -}}{{ range .Alerts }}
@警报
实例: {{ .Labels.instance }}
信息: {{ .Annotations.summary }}
详情: {{ .Annotations.description }}
时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{ end }}{{ end -}}
{{- if gt (len .Alerts.Resolved) 0 -}}{{ range .Alerts }}
@恢复
实例: {{ .Labels.instance }}
信息: {{ .Annotations.summary }}
时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
恢复: {{ .EndsAt.Format "2006-01-02 15:04:05" }}
{{ end }}{{ end -}}
{{- end }}
(3).修改配置
[root@docker-3 alertmanager]# cat /usr/local/alertmanager/alertmanager.yml
global:
	resolve_timeout: 5m
templates:
	- '/usr/local/alertmanager/wechat.tmpl'
route:
	group_by: ['alertname']
	group_wait: 10s
	group_interval: 10s
	repeat_interval: 1h
	receiver: 'wechat'

receivers:
- name: 'wechat'
	wechat_configs:
	- corp_id: 'wwf4ee8ede83b63a1a'
		to_party: '1'
		agent_id: '1000003'
		api_secret: 'LbVzYRczEJMY2rq0c8I8ZjASPfCtzvl3f7zfiuyVKSc'
		send_resolved: true
		message: '{{ template "wechat.tmpl" . }}'

inhibit_rules:
	- source_match:
		severity: 'critical'
	target_match:
		severity: 'warning'
	equal: ['alertname', 'dev', 'instance']

prometheus告警指标mongodb prometheus告警配置_html_09


prometheus告警指标mongodb prometheus告警配置_Prometheus_10

5.告警的标签、路由、分组

(1).标签

给每个监控项添加标签
/usr/local/prometheus/rules/mysql.yml
如下面的标签定义为

labels:
	severity: warning

prometheus告警指标mongodb prometheus告警配置_html_11

(2).路由
routes:
	- match:
		severity: critical
		receiver: 'leader'
		continue: true
	- match_re:
		severity: ^(warning|critical)$
		receiver: 'devops'
		continue: true
(3).分组
route:
	group_by: [severity]
(4).配置
[root@docker-3 alertmanager]# cat alertmanager.yml

global:
	resolve_timeout: 10s
	smtp_smarthost: 'smtp.163.com:25'
	smtp_from: 'jumpservervip@163.com'
	smtp_auth_username: 'jumpservervip@163.com'
	smtp_auth_password: 'xxx'
	smtp_require_tls: false
templates:
	- '/usr/local/alertmanager/*.tmpl'
	
route:
	group_by: [severity]
	group_wait: 10s
	group_interval: 3m
	repeat_interval: 3m
	receiver: 'email'
	routes:
	- match:
		severity: critical
		receiver: 'leader'
		continue: true
	- match_re:
		severity: ^(warning|critical)$
		receiver: 'devops'
		continue: true
		
receivers:
- name: 'email'
	email_configs:
	- to: 'jumpservervip@126.com'
		html: '{{ template "email.to.html" . }}'
		send_resolved: true
- name: 'leader'
	email_configs:
	- to: 'jumpservervip@163.com'
		html: '{{ template "email.to.html" . }}'
		send_resolved: true
- name: 'devops'
	wechat_configs:
	- corp_id: 'wwf4ee8ede83b63a1a'
		to_party: '1'
		agent_id: '1000003'
		api_secret: 'LbVzYRczEJMY2rq0c8I8ZjASPfCtzvl3f7zfiuyVKSc'
		send_resolved: true
		message: '{{ template "wechat.tmpl" . }}'
		
inhibit_rules:
	- source_match:
		severity: 'critical'
	target_match:
		severity: 'warning'
	equal: ['alertname', 'instance']