前言



注意标题说的监控dell服务器硬件,指的是监控服务器硬件的状态(磁盘,内存,电源等的状态),不是指监控硬件性能,磁盘的空间,内存等的使用量.而是类似于zabbix监控idrac的snmp获取硬件状态.

现在大部分公司是使用prometheus监控容器和服务,zabbix监控硬件,端口,当然还有其他监控架构.这里就不对比各个监控的优劣了.仅仅是做一篇文档.该文档对基础的内容解释不太详尽,仅适合具备一些prometheus基础的查看.不适合未接触者



前提条件



<1>各个需要监控的服务器开启idrac的snmp,并设置团体名,类似于密码(默认是public)
注意自己设置的密码,后面要用到

promethus java 监控接口 prometheus监控硬件_json



promethus java 监控接口 prometheus监控硬件_linux_02



<2>由于安全问题,对网络一般进行了限制.找一台可以ping通各服务器idrac IP地址的服务器,安装snmp监控组件



<3>prometheus服务器需要能联通snmp_exporter




组件安装



安装依赖



yum -y install gcc gcc-g++ make net-snmp net-snmp-utils net-snmp-libs net-snmp-devel golang git


snmp_exporter安装



<1>下载snmp_exporter

https://github.com/prometheus/snmp_exporter/releases

cd /data
wget https://github.com/prometheus/snmp_exporter/releases/download/v0.20.0/snmp_exporter-0.20.0.linux-amd64.tar.gz
tar xf snmp_exporter-0.20.0.linux-amd64.tar.gz
mv snmp_exporter-0.20.0.linux-amd64 snmp_exporter


<2>配置启动方式

根据系统版本配置启动方式,暂时不需要启动(没有生成snmp)

Centos7

cat /usr/lib/systemd/system/snmp-exporter.service 

[Unit]
Description=SNMP exporter
Documentation=https://github.com/prometheus/snmp_exporter


[Service]
ExecStart=/data/snmp_exporter/snmp_exporter \
--config.file=/data/snmp_exporter/snmp.yml \
--web.listen-address=:9116 \
--snmp.wrap-large-counters
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure

[Install]
WantedBy=multi-user.target


管理方式:
systemctl daemon-reload
systemctl enable snmp-exporter
systemctl restart snmp-exporter
systemctl status snmp-exporter
systemctl stop snmp-exporter
Centos6

cat /etc/init.d/snmp_exporter 
#!/bin/bash

# chkconfig: 2345 80 80
# description: Start and Stop snmp_exporter
# Source function library.

. /etc/init.d/functions

prog_name="snmp_exporter"
prog_path="/data/${prog_name}"
pidfile="/var/run/${prog_name}.pid"
prog_logs="/data/${prog_name}/${prog_name}.log"
options="--config.file=/data/snmp_exporter/snmp.yml --web.listen-address=:9116 --snmp.wrap-large-counters"
DESC="snmp_exporter"

[ -x "${prog_path}" ] || exit 1

RETVAL=0

start(){
action $"Starting $DESC..." su -s /bin/sh -c "nohup $prog_path $options >> $prog_logs 2>&1 &" 2> /dev/null
RETVAL=$?
PID=$(pidof ${prog_path})
[ ! -z "${PID}" ] && echo ${PID} > ${pidfile}
echo
[ $RETVAL -eq 0 ] && touch /var/lock/subsys/$prog_name
return $RETVAL
}

stop(){
echo -n $"Shutting down $prog_name: "
killproc -p ${pidfile}
RETVAL=$?
echo
[ $RETVAL -eq 0 ] && rm -f /var/lock/subsys/$prog_name
return $RETVAL
}

restart() {
stop
start
}

case "$1" in
start)
start
;;
stop)
stop
;;
restart)
restart
;;
status)
status $prog_path
RETVAL=$?
;;
*)
echo $"Usage: $0 {start|stop|restart|status}"
RETVAL=1
esac

------------------------------------------------------------
cat  /etc/sysconfig/snmp_exporter
ARGS=""



------------------------------------------------------------
管理方式:
chmod +x /etc/init.d/snmp_exporter
chkconfig snmp_exporter on
/etc/init.d/snmp_exporter restart



mibs下载并生成snmp.yml



MIB与OID



OID是SNMP代理提供的用于唯一标识一个对象或者信息的id,1.3.6.1.4.1.4413.1.3.2.1.17这样一串数字
MIB是按树结构存放了OID对应信息的数据库



<1>下载适合自己服务器型号的mib,查看兼容的系统

https://www.dell.com/support/search/zh-cn#q=mibs&sort=relevancy&f:langFacet=[zh]

promethus java 监控接口 prometheus监控硬件_json_03

wget https://dl.dell.com/FOLDER06009600M/1/Dell-OM-MIBS-940_A00.zip
unzip Dell-OM-MIBS-940_A00.zip


<2>查看OID

snmptranslate -Tz -m /root/support/station/mibs/iDRAC-SMIv2.mib
cp /usr/share/snmp/mibs/SNMPv2-SMI.txt /root/support/station/mibs/


<3>生成snmp.yml

官方地址:
https://github.com/prometheus/snmp_exporter/tree/main/generator#file-format

# 配置变量
export GO111MODULE=on
export GOPROXY=https://mirrors.aliyun.com/goproxy/
export MIBDIRS=/root/support/station/mibs/

#拉取generator
go get github.com/prometheus/snmp_exporter/generator
cd ${GOPATH-$HOME/go}/pkg/mod/github.com/prometheus/snmp_exporter@v0.20.0/generator
go build


#编辑generator.yml
(community要设置为你idrac的snmp团体名)

vim generator.yml

modules:
  idrac:
    walk:
      - 1.3.6.1
    version: 2
    timeout: 30s
    auth:
      community: public

#生成监控指标
./generator generate
cp -r snmp.yml /data/snmp_exporter/


<4>启动snmp_exporter

systemctl restart snmp-exporter
/etc/init.d/snmp_exporter restart


<5>测试指标抓取是否正常

http://snmp_exporter的IP:9116

promethus java 监控接口 prometheus监控硬件_promethus java 监控接口_04

备注:
Target填入要抓取的服务器的远程管理卡ip,服务器内部配置的网卡的ip无效 
Module:填入该snmp的模块,snmp.yml文件中walk上面
如果你部分的服务器snmp的密码是其他的,建议拷贝一个新的snmp文件,修改文件最末尾的community: xxx
cat snmp.yml

promethus java 监控接口 prometheus监控硬件_服务器_05

promethus java 监控接口 prometheus监控硬件_服务器_06




Prometheus配置



prometheus不管用什么方式,如果安装了就不需要再次安装了,重点是 prometheus.yml中添加一段idrac配置
可能以后再写prometheus监控及报警相关文档



prometheus配置



<1>配置从何处读取报警规则

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
    - "rule/*.yml"  
  # - "second_rules.yml"
创建报警规则的目录,在目录中写入报警规则的文件
mkdir rule
vim idrac.yml


<2>配置job,设置要收集或排除的指标

方式一
static_configs方式

- job_name: 'IDRAC'
  scrape_interval: 180s             #抓取数据的间隔
  scrape_timeout: 180s              #抓取数据的超时时间
  static_configs:
    - targets:
        - 123.123.123.123           #要监控的idrac ip,默认snmp端口161
#       - 123.123.123.123:161       #如果是其他端口,也可以加端口
#      labels:                      #labels可根据需求添加标签,例如该idrac对应的内部ip,工作机房等
#        IP: 'xxx'
#        project: 'xxx'
  metrics_path: /snmp
  params:
    module: [dell]                  #
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: xxxxx:9116      #你的snmp_exporter服务器


该模式特点,要监控哪几台就需要在targets添加几台.如果是几百台会导致prometheus.yml文件行数特别多


方式二
file_sd_configs方式


  - job_name: "IDRAC"
    params:
      module: 
      - idrac
    scrape_interval: 180s              
    scrape_timeout: 180s
    metrics_path: /snmp
    file_sd_configs:
      - files:
        - targets/*.json               #读取json文件,目录名称任意,但是得创建
        refresh_interval: 5m           #该文件载入时间,多长时间载入一次
    relabel_configs:
     - source_labels: [__address__]
       target_label: __param_target
     - source_labels: [__param_target]
       target_label: instance
     - target_label: __address__
       replacement: xxxx:9116         #你的snmp_exporter服务器   




该模式特点,需要创建json文件,监控项写入json文件,json格式如下:
cat targets/idrac.json

[
  {
    "targets": [
      "123.123.123.123:161"
    ],
    "labels": {
      "IP": "xxxx",
      "Project": "xxx"
    }
  },
  {
    "targets": [
      "123.123.123.124:161"
    ],
    "labels": {
      "IP": "xxx",
      "Project": "xxx"
    }
  }
]

or

[
  {
    "targets": [
      "123.123.123.123:161",
      "123.123.123.124:161"
    ],
    "labels": {
      "IP": "xxxx",
      "Project": "xxx"
    }
  }
]


方式三
consul_sd_file方式
该方式是将监控注册到consul服务中,prometheus通过consul实现服务的自动发现
这里就不详细介绍consul,没有使用过consul和配置过prometheus报警的暂时不建议看这个方式,不容易理解


  - job_name: 'IDRAC'
    params:
      module:
      - idrac
    scrape_interval: 180s
    scrape_timeout: 180s
    metrics_path: /snmp
    consul_sd_configs:
    - server: 'monitor-consul.com:8500'           #这个是你consul服务的域名,也可直接填入ip
      tag_separator: ','
      services: []
    relabel_configs:
      - source_labels: [__meta_consul_tags]
        regex: .*idrac.*                          #这个是将你consul打的tags中符合该正则的指标归类到该Job
        action: keep
      - source_labels: ['__meta_consul_service_metadata_eth-ip']     #这个是你consul打的标签,在prometheus -> Targets -> IDRAC ->Endpoint展示出来
        target_label: __param_target            
      - source_labels: ['__meta_consul_service_address']
        target_label: instance
      - target_label: __address__
        replacement: xxx:9116



该模式特点,需要将服务注册到consul,有静态和文件两种注册方式:
json示例如下,根据需求写自己的(标签随意,但要符合你报警的钉钉群的关键字,符合alertmanger相关配置)



cat consul-idrac.json 
{
  "ID": "IDRAC-xxx",
  "Name": "IDARC-xxx",
  "Tags": [
    "idrac"
  ],
  "Address": "xxx",                                  #IDRAC IP
  "Meta": {                                          #consul里的标签,之后标签会重写成prometheus的标签
	"eth-ip":"xxx",                                  #服务器业务ip
	"project":"beijing"                              #所在地 
  },
  "EnableTagOverride": false,
  "Check": {
	  "HTTP": "http://xxxx:9116/metrics",            #你的snmp服务器IP和端口.健康检查
      "Interval": "10s"
  },
  "Weights": {
    "Passing": 10,
    "Warning": 1
  }
}


说明:由于健康检查使用的是snmp_exporter实际上检查的是snmp_exporter,因此哪怕前面的IP等内容是错误的,consul状态也是正常.不过不影响prometheus去监控,服务注册到consul后,它只是从consul获取服务的值和标签,然后prometheus再根据自己的配置去进行监控.对于snmp适合第二种json

or
cat consul-idrac2.json

{
  "ID": "IDRAC-xxx",
  "Name": "IDARC-xxx",
  "Tags": [
    "idrac"
  ],
  "Address": "xxx:161",
  "Meta": {                                          #consul里的标签,之后标签会重写成prometheus的标签
	"eth-ip":"xxx",                                  #服务器业务ip
	"project":"beijing"                              #所在地 
  }
}




注册
curl --request PUT --data @consul-idrac.json http://monitor-consul.com:8500/v1/agent/service/register?replace-existing-checks=1
取消注册
curl -X PUT http://monitor-consul.com:8500/v1/agent/service/deregister/IDRAC-xxx


效果:

promethus java 监控接口 prometheus监控硬件_IP_07




报警规则配置



注意你的snmp.yml中的指标,但是并不是所有的指标都可使用,可以在prometheus上搜索一下

promethus java 监控接口 prometheus监控硬件_服务器_08

promethus java 监控接口 prometheus监控硬件_服务器_09



cat rule/idrac.yml 
groups:
    - name: IDRAC-物理机硬件运行状态
      rules:

      - alert: IDRAC状态
        expr: up{job=~"IDRAC.*"} == 0
        for: 1m
        labels:
          status: error
        annotations:
          description: "{{$labels.instance}} IDRAC异常"

      - alert: 机箱组件整体状态
        expr: chassisStatus != 3
        for: 1m
        labels:
          status: error
        annotations:
          summary: "机箱组件总体运行状态异常请及时查看!!"
          description: "{{$labels.instance}}机箱组件异常"

      - alert: 机箱CMOS电池整体状态
        expr: systemBatteryStatus != 3 
        for: 1m
        labels:
          status: error
        annotations:
          summary: "机箱CMOS电池整体状态异常请及时查看!!"
          description: "{{$labels}}机箱CMOS电池状态异常"


      - alert: 内存条运行状态
        expr: memoryDeviceStatus != 3
        for: 1m
        labels:
          status: error
        annotations:
          summary: "内存条状态异常请及时查看!!"
          description: "{{$labels.instance}} 内存条 {{$labels.memoryDeviceIndex}}异常"


      - alert: 处理器CPU总体状态
        expr: processorDeviceStatusStatus != 3 
        for: 1m
        labels:
          status: error
        annotations:
          summary: "处理器CPU总体状态异常请及时查看!!"
          description: "{{$labels.instance}} 处理器CPU{{$labels.processorDeviceStatusIndex}}异常"

      - alert: 网卡状态
        expr: networkDeviceStatus != 3 
        for: 1m
        labels:
          status: error
        annotations:
          description: "{{$labels.instance}} 网卡{{$labels.networkDeviceIndex}}异常"

      - alert: ps电源总体状态
        expr: powerSupplyStatus != 3 
        for: 1m
        labels:
          status: error
        annotations:
          summary: "ps电源总体状态异常请及时查看!!"
          description: "{{$labels.instance}} ps电源 {{ $labels.powerSupplyIndex }}状态异常"

      - alert: 存储控制器总体状态
        expr: globalStorageStatus != 3 
        for: 1m
        labels:
          status: error
        annotations:
          summary: "存储控制器状态异常请及时查看!!"
          description: "{{$labels.instance}} 存储控制器异常"

      - alert: 物理系统组件总体状态
        expr: globalSystemStatus != 3 
        for: 1m
        labels:
          status: error
        annotations:
          summary: "物理系统总体组件运行状态异常请及时查看!!"
          description: "{{$labels.instance}} 物理系统组件异常"

      - alert: 物理磁盘运行状态
        expr: physicalDiskState != 3
        for: 1m
        labels:
          status: error
        annotations:
          summary: "物理磁盘运行状态异常请及时查看!!"
          description: "{{$labels.instance}} 物理磁盘{{$labels. physicalDiskNumber}}异常"

      - alert: 虚拟磁盘运行状态
        expr: virtualDiskState != 2
        for: 1m
        labels:
          status: error
        annotations:
          summary: "虚拟磁盘运行状态异常请及时查看!!"
          description: "{{$labels.instance}} 虚拟磁盘{{$labels.virtualDiskNumber}}异常"
重新加载prometheus
curl -X POST http://xxxx:9090/-/reload   #prometheus的IP


要想报警 还需要配置 报警插件Alertmanager 和 钉钉插件prometheus-webhook-dingtalk ,并在dingding群添加机器人.这里就不演示报警流程了.



补充



关于出现snmp数据抓取的异常,需要注意几点:
<1>snmp_exporter服务器是否可访问要监控服务器的远程管理卡
<2>注册的json的IP是idrac的IP不要是服务器内部IP
<3>远程管理卡的snmp密码,是否你的snmp.yml中的community
<4>监控的超时时间需要根据网络情况调整,否则会因超时频繁报警
<5>如有大量idrac需监控,可配置多个job_name监控分流
<6>报警信息可根据需求调整,报警没的出来查看Alertmanager和prometheus-webhook-dingtalk配置