SOLVED: Turns out problem comes from gunicorn preloading and forking vs the apscheduler. See comment.
Background
I am writing a simple flask API that does periodic background query to a SQL database using apscheduler, then serves incoming rest requests with flask. The API will do different aggregation based on the incoming request.
I have a data class object that has methods for 1) querying/updating, 2) responding to aggregation requests. The problem arises when somehow the flask resource seems to be stuck at an older version of the data while the logs show that the query/update method was called properly.
Code so far
I broke down my app in modules as follow:
app/
├── app.py
└── apis
├── __init__.py
└── model1.py
Data model file
In model1.py
, I defined the data class, the API endpoints with flask-restplus
namespace, and initialize the data object:
from flask_restplus import Namespace, Resource
import pandas as pd
api = Namespace('sales')
@api.route('/check')
class check_sales(Resource):
def post(self):
import json
req = api.payload
result = data.get_sales(**req)
return result, 200
class sales_today():
def __init__(self):
self.data = None
self.update()
def update(self):
# some logging here
self.data = self.check_sql()
logging.debug("Last Order: %s" % str(self.data.sales_time.max()))
def check_sql(self):
query = """
SELECT region, store, item, sales_count, MAX(UtcTimeStamp) as sales_time FROM db GROUP BY 1,2,3
"""
sales = pd.read_gbq(query)
return sales
def get_sales(self, **kwargs):
'''
kwargs here is a dict where we filter and sum
'''
for arg_name in (x for x in kwargs):
mask = True
if type(kwargs[arg_name]) is str:
arg_value = kwargs[arg_name].split(',')
mask = mask & (self.data[arg_name].isin(arg_value))
result = {k:v for k,v in kwargs.items()}
result['count'] = int(self.data.loc[mask]['sales_count'])
result['last_updated'] = str(self.data.sales_time.max())
return result
data = sales_today()
Module init file
In __init__.py
inside app/apis
I pass the data object instance as well as the api namespace.
from .model1 import api as ns_model1
from .model1 import data as data_model1
def add_apins(api):
api.add_namespace(ns_model1, path='/model1')
Main app file
In the main app.py
file I layout the scheduler to keep the data refreshed every 5 minutes with apscheduler
. I then serve this app with gunicorn.
import atexit
from apscheduler.schedulers.background import BackgroundScheduler
from flask import Flask
from flask_restplus import Resource, Api
from apis import add_apins
from apis import data_model1
# parameters
port = 8888
poll_freq = '0-59/5'
# flask app
main_app = Flask(__name__)
api = Api()
add_apins(api)
api.init_app(main_app)
# background scheduler
sched = BackgroundScheduler()
sched.add_job(data_model1.update, 'cron', minute=poll_freq)
sched.start()
atexit.register(lambda: sched.shutdown(wait=False))
if __name__ == "__main__":
# serve(application, host='0.0.0.0', port=port) # ssl_context="adhoc" for https testing locally
run_simple(application=main_app, hostname='0.0.0.0', port=port, use_debugger=True)
Expectation and issues
Since the query is updated every 5 minutes, I would expect whenever I query the /check
endpoint, the responding payload's last_updated
value will match the latest from the logs (logging.debug
line in the update()
method). However, I'm getting responses indicating that the last_updated
value equals to when the app was run initially.
I have confirmed in the DB that indeed data is up to date there, and from logging, I'm also confirmed that the update()
method is being run every 5 minutes and showing the latest timestamp.
I also noticed that the app runs fine with python app.py
in Windows, but when running the app with gunicorn it starts exhibiting this weird behaviour.
I am quite puzzled as to where things go wrong. Could it be scoping? Or am I passing the instance between modules wrongly?
Thank you so much for your time and help. Any ideas would be much appreciated.