Python real time parallel / distributed data processing

Question

I have a Python application that processes in-memory-data. It provides <1 second response, by querying ~1 million records and then aggregating the result set. What would be the best Python framework(s) to make this application more scalable ?

Here are more details :

Data is loaded from a single table on disk which is loaded into memory as numpy arrays and custom indexes using dictionaries.
This application starts breaching the 1 second time limit when the number of records grow more than > 5 million. Search part / locating the indexes takes 100 ms only. I see lot of time (900 to 2000 milli secs) is spent in just summing up the result set.
Also able to see CPU cores & RAM are not used to their full capacity. I see each core is used only upto 20% and a plenty of memory is free.

I just read a long list of python frameworks on distributed computing. Looking for specific solutions for real time responses, by:

making a better usage of available CPU & RAM in single machine through parallel processing to stay within < 1 second response time.
later by extending it beyond a single machine, to support even ~100 million records. This data is in a single table/file that can be horizontally partitioned across many machines, and these machines can work independently on their own data.

Suggestions from what you have "seen" working from your past experience is greatly appreciated.

I have just got killed my question on other's experiences. I see the same list of frameworks - impossible to study every solution one by one and find out it useless at the end of the day. — jaromrax, Jan 16 '17 at 13:42

Python real time parallel / distributed data processing

0 Answers0