Scalable way to get data ready / into pandas or consorts

Question

I have around 600GB of csv files, around 1 billion lines, stored in around 80 million text files.

For performing additional analysis, specifically network analysis, I would have to first aggregate some of the data and then do the analysis building part.

Normally, I would use a Database to parse and store the CSVs content and do the aggregation there, but I like the idea of working in-memory because of the heavy computing-resources at work.

For the aggregation type of stuff - how would one do that utilizing 40 Cores and 120GB of RAM? Pandas will probably not do the trick, what about Dash and Modin? Just reading the 600GB of CSV into a dataframe, aggregating and then again saving it to a csv seems like an idea...

Are your csv files ordered? I mean for a group aggregation do you need the first X files in a row or your data is scattered? — Corralien, Mar 16 '23 at 17:34
The files are named entities, however, for the first aggregation I plan to do a count for groups. Therefore, I would either need to iteratively look at each and every line and up the count or have a complete dataframe with all the lines. TL;DR: They are not scattered, but I do not plan to aggregate based on file-names but line-content. — Ranger, Mar 16 '23 at 17:41
Yes but you can read sequentially and when you know the end of group is reach (because the group value of a column change for example) apply your aggregation on your data already loaded then store the result. And continue for the next group and so on. — Corralien, Mar 16 '23 at 18:36
@Corralien I can read sequentially, but at some point I would still have to have it in memory since the group by is indicated by a column value and not the file-name :( — Ranger, Mar 17 '23 at 19:08
@westandskif Googled and found not many entries on it, seems like a cloud solution or a DBMS? Would not make sense here. If I did want o establish a full-scale database I would have just used postgres or consorts — Ranger, Mar 17 '23 at 19:10
There is no problem to that. You read the first file, all values are G1 for example, continue to read files sequentially until you meet a new group in the dataframe, concat all rows with G1 and compute aggregation. Repeat the same operation for G2 until you meet the next group. In other words, aggregate values over time instead of at the end. — Corralien, Mar 17 '23 at 19:11

Scalable way to get data ready / into pandas or consorts

0 Answers0