I'm impressed by the speed of running transformations, loading data and ease of use of Pandas
and want to leverage all these nice properties (amongst others) to model some large-ish data sets (~100-200k rows, <20 columns). The aim is to work with the data on some computing nodes, but also to provide a view of the data sets in a browser via Flask
.
I'm currently using a Postgres database to store the data, but the import (coming from csv files) of the data is slow, tedious and error prone and getting the data out of the database and processing it is not much easier. The data is never going to be changed once imported (no CRUD operations), so I thought it's ideal to store it as several pandas DataFrame
(stored in hdf5 format and loaded via pytables).
The question is:
(1) Is this a good idea and what are the things to watch out for? (For instance I don't expect concurrency problems as DataFrame
s are (should?) be stateless and immutable (taken care of from application-side)). What else needs to be watched out for?
(2) How would I go about caching the data once it's loaded from the hdf5 file into a DataFrame
, so it doesn't need to be loaded for every client request (at least the most recent/frequent dataframes). Flask
(or werkzeug
) has a SimpleCaching
class, but, internally, it pickles the data and unpickles the cached data on access. I wonder if this is necessary in my specific case (assuming the cached object is immutable). Also, is such a simple caching method usable when the system gets deployed with Gunicorn (is it possible to have static data (the cache) and can concurrent (different process?) requests access the same cache?).
I realise these are many questions, but before I invest more time and build a proof-of-concept, I thought I get some feedback here. Any thoughts are welcome.