Using Pytables with Pandas or just Numpy?

Question

Here's my use case:

1.

Initially, I have around 20GB of JSON files that I need to store for processing. I'll parse them and the initial table would be like:

requestId       A        B        C          Ap        Bp        Cp
-------------------------------------------------------------------
A723B23C       10        55      51          0         0         0 
D412J34N       20        51      91          0         0         0
GJF834NF       30        59      71          0         0         0

RequestId is unique.

2.

After that, I need to do some computations on each column A,B,C which involve calculating the percentile rank of each element in A, B, C.

3.

After the data is prepared; I need to do simple 'where Ap>20 and Ap<30' type queries on the table. And calculate averages or create a histogram with the resulting dataset.

Q1 I decided to go with Pytables for storing the data. But the question is, would Pandas be beneficial in my use case? Would it make my life easier or would it be an unnecessary complication?

Q2 I'm expecting to get a separate dataset with, say, D E and F. This will again have RequestID; with ~80% overlap in requestIDs. I might need to perform a JOIN-type operation between the two tables so that I can correlate and analyze data from both datasets. I understand there's no actual JOIN support in Pytables but there's some workaround method. But I haven't found much information about it's efficiency or speed! Has anyone tried it? What sort of performance can I expect?

I know nothing about Pytables but Pandas will do all of the above pretty easily. — Ajean, Sep 04 '14 at 15:04
Do you know if the parsed data will fit comfortably in memory? — mdurant, Sep 04 '14 at 15:11
sounds like you should store the data with pytables, and use pandas to do the join and file I/O. — Paul H, Sep 04 '14 at 15:37
docs are here: http://pandas.pydata.org/pandas-docs/stable/io.html#hdf5-pytables; this is a pretty straightforward problem — Jeff, Sep 04 '14 at 15:44

Using Pytables with Pandas or just Numpy?

0 Answers0