背景:
flask 项目,gunicorn server,apscheduler 定时框架
Server 配置:1 个 worker,timeout 设置为 900s(15min),单点
apscheduler 定时任务配置:max thread 20, max process 5,misfire_grace_time 900,当前 job 几百个
问题:
运行时遇到有些 job 莫名其妙中断,job 本身 code 没有问题
日志:
日志 1:
2019-08-08 09:22:18,410 base.py:120:WARNING:Run time of job "Monitor Job (trigger: interval[1 day, 0:00:00], next run at: 2019-08-08 09:07:07 GMT)" was missed by 0:15:10.588683 |
日志 2:
2019-08-08 09:06:56,435 __init__.py:184:INFO:Job update sql:insert xx into xxx | |
[2019-08-08 09:21:56 +0000] [9765] [CRITICAL] WORKER TIMEOUT (pid:10176) | |
[2019-08-08 09:21:57 +0000] [10440] [INFO] Booting worker with pid: 10440 | |
2019-08-08 09:21:57,935 base.py:433:INFO:Adding job tentatively -- it will be properly scheduled when the scheduler starts | |
2019-08-08 09:21:58,055 base.py:867:INFO:Added job "manage_backend_jobs" to job store "default" | |
2019-08-08 09:21:58,055 base.py:159:INFO:Scheduler started |
分析:
日志 1 原因:根据配置观察,可能因为当前线程池线程数量达到上限,该 job 持续等待,超过了 900s 仍未执行,job 中断
日志 2 原因:查看日志时间有个较大幅度的变化,正好符合配置中的 15min 超时设置,可以认为该 job 在运行阶段因网络 IO 或其他原因阻塞,导致 worker 达到超时上限,scheduler 重启
解决:
(1)查询服务器当前性能,适当提高进程池,线程池配置
(2)分布式 Job
参考:
https://github.com/benoitc/gunicorn/issues/1801