通过 crontab 定时检查进程运行情况

文章目录
  1. 设置定时任务
    1. 定时任务逻辑
      1. 一、process-daemon.sh 的守护进程脚本
      2. 二、 startup-PROCESS_NAME.sh 启动进程的脚本
    2. 实施步骤

通过设置守护进程,检查目标进程是否正常运行的 shell 脚本。

设置定时任务

1
2
crontab -l
*/1 * * * * bash /home/monitor/process-daemon.sh > /home/monitor/process-daemon.log 2>&1

定时任务逻辑

  1. 设置 -e 当出现问题时要抛出异常。
  2. 设置监控的进程名字(需要唯一),设置进程能使用的最大cpu资源(当前cpu数量的一半)。
  3. 设置自动标示位,当进程因为自身占用资源过大导致被守护进程kill后保证不会再自动重启,需要手工清理标识。
  4. 通过pgrep检查进程数量,当数量等于1时,用top检查其cpu资源当前使用情况,如果大于阈值,则发送信息给es集群,记录异常然后kill该进程。
  5. 如果pgrep检查进程数量大于1,也发送信息给es。
  6. 如果不满足上述两个条件则先检查是否存在因为cpu利用率超阈值被kill的标识,如果没有则启动,有则无动作。

一、process-daemon.sh 的守护进程脚本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
#!/bin/bash
# author ice
# 2020-01-13
# daemon for special process
set -e # throw execption
#set -x # for debug

function log(){
echo "`date \"+%Y-%m-%d %H:%M:%S\"` $1"
}

PROCESS_NAME=gohangout
((CPU_THRESHOLD=`lscpu | grep "^CPU(s)" | awk '{print $2}'`*100/2))
PROCESS_KILLED_FLAG=${PROCESS_NAME}_killed_flag
PROCESS_COUNT=`pgrep $PROCESS_NAME | wc -l`
START_PROCESS=/home/monitor/${PROCESS_NAME}.sh

ES_ACCOUNT=monitor
ES_PWD=Jkzyzh4lb
ES_URL=http://appmon.btit.huawei.com/
ES_INDEX=filebeat-agent

log "$PROCESS_NAME instance count : $PROCESS_COUNT"
# check process instance number
if [ $PROCESS_COUNT -eq 1 ];then
log "$PROCESS_NAME is running."
# get process cpu usage
PROCESS_CPU_USAGE=`top -b -n 1 | grep $PROCESS_NAME | awk '{print $9}'`
if test "$PROCESS_CPU_USAGE";then
log "current usage is $PROCESS_CPU_USAGE, threshold is $CPU_THRESHOLD"
# compare to threshold
if [ `echo "$PROCESS_CPU_USAGE > $CPU_THRESHOLD"|bc` -eq 1 ];then
log "$PROCESS_NAME cpu useage is greater than $CPU_THRESHOLD, stop $PROCESS_NAME instance"
# send message to elasticsearch
MESSAGE="{\"@timestamp\":\"`date \"+%Y-%m-%dT%H:%M:%S%z\"`\",\"host\":\"`hostname`\",\"app\":\"${PROCESS_NAME}\",\"cpu_usage\":$PROCESS_CPU_USAGE }"
curl -s -u ${ES_ACCOUNT}:${ES_PWD} ${ES_URL}${ES_INDEX}/doc -d "$MESSAGE"
# killed process and make flag
log "kill process: 'kill -9 `pgrep $PROCESS_NAME`'"
kill -9 `pgrep $PROCESS_NAME`
touch $PROCESS_KILLED_FLAG
fi
fi
elif [ $PROCESS_COUNT -ge 1 ];then
log "there are too many $PROCESS_NAME instances."
MESSAGE="{\"@timestamp\":\"`date \"+%Y-%m-%dT%H:%M:%S%z\"`\",\"host\":\"`hostname`\",\"app\":\"${PROCESS_NAME}\",\"instances_count\": $PROCESS_COUNT }"
curl -s -u ${ES_ACCOUNT}:${ES_PWD} ${ES_URL}${ES_INDEX}/doc -d "$MESSAGE"
else
log "no $PROCESS_NAME instance exists."
if [ ! -f "$PROCESS_KILLED_FLAG" ]; then
log "start $PROCESS_NAME."
bash ${START_PROCESS}
else
log "detect $PROCESS_KILLED_FLAG, execute nothing"
fi
fi

二、 startup-PROCESS_NAME.sh 启动进程的脚本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#/bin/bash
# auth ice
# date 2020-01-13
# desc: start process

PROCESS_NAME=gohangout
PROCESS_COUNT=`pgrep $PROCESS_NAME | wc -l`
WORK_DIR=/home/monitor/workspace
START_PROCESS=nohup ${WORK_DIR}/${PROCESS_NAME} --worker 1 --config ${WORK_DIR}/config.yml >> ${WORK_DIR}/${PROCESS_NAME}.log 2>&1 &

if [ $PROCESS_COUNT -ge 1 ];then
echo "there is more than 1 instance ,can't start new."
elif [[ $# -eq 1 ]];then
if [[ -f "$1" ]];then
echo "config file: $1"
nohup ${WORK_DIR}/${PROCESS_NAME} --worker 1 --config $1 >> ${WORK_DIR}/${PROCESS_NAME}.log 2>&1 &
else
echo "error, config file: $1 is not exists"
fi
else
echo "start process with default config."
${START_PROCESS}
fi

实施步骤

  1. 在用户主目录下创建 PROCESS_NAME.sh 的启动脚本
  2. 在用户主目录下创建 process-daemon.sh 的守护进程脚本