9

I'm impressed by the speed of running transformations, loading data and ease of use of Pandas and want to leverage all these nice properties (amongst others) to model some large-ish data sets (~100-200k rows, <20 columns). The aim is to work with the data on some computing nodes, but also to provide a view of the data sets in a browser via Flask.

I'm currently using a Postgres database to store the data, but the import (coming from csv files) of the data is slow, tedious and error prone and getting the data out of the database and processing it is not much easier. The data is never going to be changed once imported (no CRUD operations), so I thought it's ideal to store it as several pandas DataFrame (stored in hdf5 format and loaded via pytables).

The question is:

(1) Is this a good idea and what are the things to watch out for? (For instance I don't expect concurrency problems as DataFrames are (should?) be stateless and immutable (taken care of from application-side)). What else needs to be watched out for?

(2) How would I go about caching the data once it's loaded from the hdf5 file into a DataFrame, so it doesn't need to be loaded for every client request (at least the most recent/frequent dataframes). Flask (or werkzeug) has a SimpleCaching class, but, internally, it pickles the data and unpickles the cached data on access. I wonder if this is necessary in my specific case (assuming the cached object is immutable). Also, is such a simple caching method usable when the system gets deployed with Gunicorn (is it possible to have static data (the cache) and can concurrent (different process?) requests access the same cache?).

I realise these are many questions, but before I invest more time and build a proof-of-concept, I thought I get some feedback here. Any thoughts are welcome.

orange
  • 7,755
  • 14
  • 75
  • 139
  • Interesting question. I also deal with large quantities of read only data, and often wonder about a lean and pragmatic approach to doing this. Can I ask you how many records you are dealing with? (I have about 300 million = ca. 4GB of data) – Hexatonic Jun 08 '16 at 19:09
  • It's in the range of tens of thousands (perhaps low 100,000 at max). I settled with hdf5 and am very happy with this decision and the use of Pandas (have been for almost 2 years now). The good thing about hdf5 and pytables is that you can run queries on disk without loading the whole file. With my smallish dataset, however, I never had the need to do this (in your case this may be different). I was particularly impressed by the i/o speed of of reading a hdf5 file compared to a sql query. – orange Jun 09 '16 at 10:45

1 Answers1

3

Answers to some aspects of what you're asking for:

It's not quite clear from your description whether you have the tables in your SQL database only, stored as HDF5 files or both. Something to look out for here is that if you use Python 2.x and create the files via pandas' HDFStore class, any strings will be pickled leading to fairly large files. You can also generate pandas DataFrame's directly from SQL queries using read_sql, for example.

If you don't need any relational operations then I would say ditch the postgre server, if it's already set up and you might need that in future keep using the SQL server. The nice thing about the server is that even if you don't expect concurrency issues, it will be handled automatically for you using (Flask-)SQLAlchemy causing you less headache. In general, if you ever expect to add more tables (files), it's less of an issue to have one central database server than maintaining multiple files lying around.

Whichever way you go, Flask-Cache will be your friend, using either a memcached or a redis backend. You can then cache/memoize the function that returns a prepared DataFrame from either SQL or HDF5 file. Importantly, it also let's you cache templates which may play a role in displaying large tables.

You could, of course, also generate a global variable, for example, where you create the Flask app and just import that wherever it's needed. I have not tried this and would thus not recommend it. It might cause all sorts of concurrency issues.

Midnighter
  • 3,771
  • 2
  • 29
  • 43
  • Thanks Midnighter. These are great comments. I have the bulk data in an SQL database, but am looking at moving it out as it causes too much headache. Some of the other data can be concurrently modified, so I keep using Postgres. Global variables didn't work when I tried Gunicorn. It seems that every (async) Gunicorn worker has its own address space, so it would also manage its own cache (which is not really useful). – orange Jul 11 '14 at 07:48
  • @orange So far I've only used Apache for deployment and there I had to move to a model where I only use one process but many threads because otherwise requests by the same user got handled by different processes which caused problems with the session. I still have to get to the bottom of that, though. So in that case a global should work fine. I wonder, though, what causes you problems when working with SQL? Using `read_sql` you can even write complete queries and just pass them in. – Midnighter Jul 11 '14 at 08:19
  • 1
    I ran some preliminary tests and noticed that `read_sql` is slower than loading the `HDF5` file from disk. But the bigger issue is that I need to import (csv -> sql tables) the bulk data which takes a lot of time and is also quite unpractical (I have to decide which columns to use/import at time of import. Leaving the file in the source format (or converting to `HDF5`) allows me to consider columns later). – orange Jul 11 '14 at 14:03