I have a large CSV file(>100 GB) that I want to read into memory and process the data in chunks. There are two constraints I have:
- Obviously I cannot read the whole entire file into memory. I only have about 8GB of ram on my machine.
- The data is tabular and unordered. I need to read the data in groups.
Ticker | Date | Field1 | Field2 | Field3 |
---|---|---|---|---|
AAPL | 20201201 | 0 | 0 | 0 |
AAPL | 20201202 | 0 | 0 | 0 |
AAPL | 20201203 | 0 | 0 | 0 |
AAPL | 20201204 | 0 | 0 | 0 |
NFLX | 20201201 | 0 | 0 | 0 |
NFLX | 20201202 | 0 | 0 | 0 |
NFLX | 20201203 | 0 | 0 | 0 |
NFLX | 20201204 | 0 | 0 | 0 |
The concern here is that the data has to be read in groups. Grouped by Ticker and date. If I say I want to read 10,000 records in each batch. The boundary of that batch should not split groups. i.e. All the AAPL data for 2020 December should end up in the same batch. That data should not appear in two batches.
Most of my co-workers when they face a situation like this, they usually create a bash script where they use awk, cut, sort, uniq to divide data into groups and write out multiple intermediate files to the disk. Then they use Python to process these files. I was wondering if there is a homogenous Python/Pandas/Numpy solution to this.