Here's my use case:
1.
Initially, I have around 20GB of JSON files that I need to store for processing. I'll parse them and the initial table would be like:
requestId A B C Ap Bp Cp
-------------------------------------------------------------------
A723B23C 10 55 51 0 0 0
D412J34N 20 51 91 0 0 0
GJF834NF 30 59 71 0 0 0
RequestId is unique.
2.
After that, I need to do some computations on each column A,B,C which involve calculating the percentile rank of each element in A, B, C.
3.
After the data is prepared; I need to do simple 'where Ap>20 and Ap<30' type queries on the table. And calculate averages or create a histogram with the resulting dataset.
Q1 I decided to go with Pytables for storing the data. But the question is, would Pandas be beneficial in my use case? Would it make my life easier or would it be an unnecessary complication?
Q2 I'm expecting to get a separate dataset with, say, D E and F. This will again have RequestID; with ~80% overlap in requestIDs. I might need to perform a JOIN-type operation between the two tables so that I can correlate and analyze data from both datasets. I understand there's no actual JOIN support in Pytables but there's some workaround method. But I haven't found much information about it's efficiency or speed! Has anyone tried it? What sort of performance can I expect?