Best way to handle Pandas Dataframe in Flask

Question

I have a flask application which reads a csv file using pandas and it returns some data after reading. There is no manipulation done to the dataframe. I have the dataframe stored in a pickle format so for every request that comes in, the application unpickles the file and reads data and returns it to the client.

from flask import Flask, request, jsonify, abort
from flask_cors import CORS, cross_origin
import pandas as pd
import os
import json

@application.route('/Getdata', methods=['GET'])
@cross_origin()
@Validate_API_Key
def index():
    fid = request.args.get('Fid', default=0, type=int)
    df = pd.read_pickle(os.getenv('DFFileName'), compression='gzip')
    res = get_fid_data(fid, df)
    data = res.to_dict(orient='records')
    return jsonify(data=data)

This is how the get_fid_data() is set up

def get_fid_data(fid, df):
    frecord = pd.DataFrame()

    # certain rows are selected from df based on fid and the
    # rows are appended to frecord. The frecord is then returned.

    return frecord

My question is, is there a way to make the the df global after reading it initially? It seems like unpickling the df for every request is unnecessary if its possible to "persist" the dataframe in memory for as long as the flask application is running. I'd like to have this so that for every request that comes in I can read the df from memory rather than reading the file.

Is there a way to achieve this?

Take a look at this https://github.com/zalando/connexion/issues/1154 Also this https://stackoverflow.com/questions/24644715/pandas-as-fast-data-storage-for-flask-application — IoaTzimas, Oct 29 '20 at 19:47
thanks. Seems like I can read the file in the `@application.before_first_request()` and cache that. I'll give that a try — sagar1025, Oct 29 '20 at 20:29
@sagar1025 you can create a `module` with `df initialisation`. just an example `df1 = pd.read_pickle(....) blablabla... df2 = pd.read_csv(....)blabla..`. in your api: `from your_module import df1` — Danila Ganchar, Oct 29 '20 at 21:13
Interesting @DanilaGanchar Will df1 be something "global" after its import? Can it be used directly inside routes? — IoaTzimas, Oct 29 '20 at 21:40
@IoaTzimas you are using other functions/instances inside routes (`abort`, `jsonify` etc) ;) — Danila Ganchar, Oct 29 '20 at 22:08
Yeah probably your are right. I was suprised because it looks a very elegant solution to store repetitive data, and i haven't seen it again in suggestions in similar questions. — IoaTzimas, Oct 29 '20 at 22:15
@IoaTzimas true for your case. because you have no dependencies on the application configuration or other instances — Danila Ganchar, Oct 29 '20 at 22:19
What about the risks of common global objects? Do they exist in such case? We can simply assign the df globally inside the app (which is not recommended because of risks). Is there any difference between this and the import from other file regarding risks? — IoaTzimas, Oct 29 '20 at 22:23
@IoaTzimas I believe that for your case, a standalone `module` with all the necessary `datasets` will be more clear and understandable solution. Moreover, it will not depend on the application level(`Flask`, `context` etc). Also you can easily `mocks` a `module` for `tests`. From my experience I can say that it is better to have less dependencies from `Flask` — Danila Ganchar, Oct 29 '20 at 22:42
@DanilaGanchar that sounds like a clean way of doing it. No cache overhead or expiring to worry about. It's a nice alternative to caching! — sagar1025, Oct 30 '20 at 01:20
@sagar1025 by the way you can move other functions(such as `def get_fid_data(fid: int) -> list`) into your `module`. not sure if you will have a lot of code there. anyway you can create more modules if you need a lot of functions for `datasets` processing. also you can easily write `unittests`. Plus you can easily replace `Flask` with anything in future - your modules and tests will works fine. — Danila Ganchar, Oct 30 '20 at 08:44

Best way to handle Pandas Dataframe in Flask

0 Answers0