通过设置守护进程,检查目标进程是否正常运行的 shell 脚本。
设置定时任务
1 2
| crontab -l */1 * * * * bash /home/monitor/process-daemon.sh > /home/monitor/process-daemon.log 2>&1
|
定时任务逻辑
- 设置 -e 当出现问题时要抛出异常。
- 设置监控的进程名字(需要唯一),设置进程能使用的最大cpu资源(当前cpu数量的一半)。
- 设置自动标示位,当进程因为自身占用资源过大导致被守护进程kill后保证不会再自动重启,需要手工清理标识。
- 通过pgrep检查进程数量,当数量等于1时,用top检查其cpu资源当前使用情况,如果大于阈值,则发送信息给es集群,记录异常然后kill该进程。
- 如果pgrep检查进程数量大于1,也发送信息给es。
- 如果不满足上述两个条件则先检查是否存在因为cpu利用率超阈值被kill的标识,如果没有则启动,有则无动作。
一、process-daemon.sh 的守护进程脚本
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
| #!/bin/bash # author ice # 2020-01-13 # daemon for special process set -e # throw execption #set -x
function log(){ echo "`date \"+%Y-%m-%d %H:%M:%S\"` $1" }
PROCESS_NAME=gohangout ((CPU_THRESHOLD=`lscpu | grep "^CPU(s)" | awk '{print $2}'`*100/2)) PROCESS_KILLED_FLAG=${PROCESS_NAME}_killed_flag PROCESS_COUNT=`pgrep $PROCESS_NAME | wc -l` START_PROCESS=/home/monitor/${PROCESS_NAME}.sh
ES_ACCOUNT=monitor ES_PWD=Jkzyzh4lb ES_URL=http://appmon.btit.huawei.com/ ES_INDEX=filebeat-agent
log "$PROCESS_NAME instance count : $PROCESS_COUNT" # check process instance number if [ $PROCESS_COUNT -eq 1 ];then log "$PROCESS_NAME is running." # get process cpu usage PROCESS_CPU_USAGE=`top -b -n 1 | grep $PROCESS_NAME | awk '{print $9}'` if test "$PROCESS_CPU_USAGE";then log "current usage is $PROCESS_CPU_USAGE, threshold is $CPU_THRESHOLD" # compare to threshold if [ `echo "$PROCESS_CPU_USAGE > $CPU_THRESHOLD"|bc` -eq 1 ];then log "$PROCESS_NAME cpu useage is greater than $CPU_THRESHOLD, stop $PROCESS_NAME instance" # send message to elasticsearch MESSAGE="{\"@timestamp\":\"`date \"+%Y-%m-%dT%H:%M:%S%z\"`\",\"host\":\"`hostname`\",\"app\":\"${PROCESS_NAME}\",\"cpu_usage\":$PROCESS_CPU_USAGE }" curl -s -u ${ES_ACCOUNT}:${ES_PWD} ${ES_URL}${ES_INDEX}/doc -d "$MESSAGE" # killed process and make flag log "kill process: 'kill -9 `pgrep $PROCESS_NAME`'" kill -9 `pgrep $PROCESS_NAME` touch $PROCESS_KILLED_FLAG fi fi elif [ $PROCESS_COUNT -ge 1 ];then log "there are too many $PROCESS_NAME instances." MESSAGE="{\"@timestamp\":\"`date \"+%Y-%m-%dT%H:%M:%S%z\"`\",\"host\":\"`hostname`\",\"app\":\"${PROCESS_NAME}\",\"instances_count\": $PROCESS_COUNT }" curl -s -u ${ES_ACCOUNT}:${ES_PWD} ${ES_URL}${ES_INDEX}/doc -d "$MESSAGE" else log "no $PROCESS_NAME instance exists." if [ ! -f "$PROCESS_KILLED_FLAG" ]; then log "start $PROCESS_NAME." bash ${START_PROCESS} else log "detect $PROCESS_KILLED_FLAG, execute nothing" fi fi
|
二、 startup-PROCESS_NAME.sh 启动进程的脚本
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| #/bin/bash # auth ice # date 2020-01-13 # desc: start process
PROCESS_NAME=gohangout PROCESS_COUNT=`pgrep $PROCESS_NAME | wc -l` WORK_DIR=/home/monitor/workspace START_PROCESS=nohup ${WORK_DIR}/${PROCESS_NAME} --worker 1 --config ${WORK_DIR}/config.yml >> ${WORK_DIR}/${PROCESS_NAME}.log 2>&1 &
if [ $PROCESS_COUNT -ge 1 ];then echo "there is more than 1 instance ,can't start new." elif [[ $# -eq 1 ]];then if [[ -f "$1" ]];then echo "config file: $1" nohup ${WORK_DIR}/${PROCESS_NAME} --worker 1 --config $1 >> ${WORK_DIR}/${PROCESS_NAME}.log 2>&1 & else echo "error, config file: $1 is not exists" fi else echo "start process with default config." ${START_PROCESS} fi
|
实施步骤
- 在用户主目录下创建 PROCESS_NAME.sh 的启动脚本
- 在用户主目录下创建 process-daemon.sh 的守护进程脚本