1.问题

发现crontab中的任务不执行

2.问题排查

查看cron的日志/var/adm/cron/log的内容如下:

root@localhost:>tail -f log
! 0481-095 The cron job is being rescheduled.
Tue Aug 31 09:06:00 CST 2021
! cron: 0481-087 The c queue maximum run limit has been reached.
Tue Aug 31 09:06:00 CST 2021
! 0481-095 The cron job is being rescheduled.
Tue Aug 31 09:06:00 CST 2021
! cron: 0481-087 The c queue maximum run limit has been reached.
Tue Aug 31 09:06:00 CST 2021
! 0481-095 The cron job is being rescheduled.
Tue Aug 31 09:06:00 CST 2021

错误提示:! cron: 0481-087 The c queue maximum run limit has been reached.

表明cron的队列已经使用完,AIX的默认值是同时运行100个cron作业,故判定某些作业有异常。表明cron的队列已经使用完,AIX的默认值是同时运行100个cron作业,故判定某些作业有异常。

(1)使用what /usr/sbin/cron,查批处理无异常;

(2)先查看现有的crontab里的作业,针对调用频繁的job的嫌疑最大。针对可疑的cron作业,使用ps -ef | grep <关键字> | wc -l,上述作业超100个,问题定位成功。

3.问题解决

(1)扩大队列,在/var/adm/cron/queuedefs文件末尾加入以下内容,再kill掉cron的PID。

# COMPONENT_NAME: (CMDCNTL) commands needed for basic system needs
#
# FUNCTIONS:
#
# ORIGINS: 27, 18
#
# (C) COPYRIGHT International Business Machines Corp. 1989,1991
# All Rights Reserved
# Licensed Materials - Property of IBM
#
# US Government Users Restricted Rights - Use, duplication or
# disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
#
# cron values for each queue of batch jobs:
#
# queue.xxjxxnxxw
#
# queues:
# a - sh jobs d - sync event
# b - batch jobs e - ksh jobs
# c - cron event f - csh jobs
#
# xxj - maximum number of jobs in this queue (deafult 100)
# xxn - nice value at which these jobs will run at (default 2)
# xxw - wait time till next execution attempt (default 60 seconds)
#
#
# here is an example of a low prority (nice 20), 50 entry batch queue
# b.50j20n60w

c.150j2n60w

(2)但以上不能彻底解决问题,建议kill并注释掉异常的cron作业,否则你不知道什么时候会再爆队列

*PS:批量kill掉相关的进程的方法:

for af in `ps -ef | grep <关键字> | grep -v grep |awk '{print $2}'`; do kill -9 ${af} ;
done

4.总结

我出问题的cron作业是蓝鲸监控的机器状态信息采集的脚本,因为蓝鲸对AIX不太友好,agent都是现改的测试不充分,最开始跑得好好的,但一段时间后执行效率下降,导致每次脚本执行都要好长的时间,最后消耗掉了全部的cron的队列。反正我最后就完全注释掉了这个作业,放弃在AIX上安装蓝鲸。

引以为戒吧!