How do you read a large file with unsorted tabular data in chunks in Python?

Question

I have a large CSV file(>100 GB) that I want to read into memory and process the data in chunks. There are two constraints I have:

Obviously I cannot read the whole entire file into memory. I only have about 8GB of ram on my machine.
The data is tabular and unordered. I need to read the data in groups.

Ticker	Date	Field1	Field2	Field3
AAPL	20201201	0	0	0
AAPL	20201202	0	0	0
AAPL	20201203	0	0	0
AAPL	20201204	0	0	0
NFLX	20201201	0	0	0
NFLX	20201202	0	0	0
NFLX	20201203	0	0	0
NFLX	20201204	0	0	0

The concern here is that the data has to be read in groups. Grouped by Ticker and date. If I say I want to read 10,000 records in each batch. The boundary of that batch should not split groups. i.e. All the AAPL data for 2020 December should end up in the same batch. That data should not appear in two batches.

Most of my co-workers when they face a situation like this, they usually create a bash script where they use awk, cut, sort, uniq to divide data into groups and write out multiple intermediate files to the disk. Then they use Python to process these files. I was wondering if there is a homogenous Python/Pandas/Numpy solution to this.

Yeah but that's determined by the number of rows, not a value. Try this solution: https://stackoverflow.com/questions/42228770/load-pandas-dataframe-with-chunksize-determined-by-column-variable — Alexander Riedel, Dec 20 '20 at 20:05
I would suggest taking a look at Dask: https://stackoverflow.com/questions/39426511/can-dask-be-used-to-groupby-and-recode-out-of-core, it is designed exactly for out of core needs like you mention — anon01, Dec 20 '20 at 20:10
Depending on the dataframe size (and the weight of the columns to be accurate), you may also try reading the 2 columns to compute indexes (using chunks and setting the dtype to categorical for ticker). — tgrandje, Dec 20 '20 at 20:15

score 0 · Answer 1 · answered Dec 20 '20 at 20:11

How about this:

open the file
loop over reading lines: For each line read:

parse the ticker
if not done already:
- create+open a file for that ticker ("ticker file")
- append to some dict where key=ticker and value=file handle
write the line to the ticker file

close the ticker files and the original file
process each single ticker file

score 0 · Answer 2 · answered Dec 20 '20 at 20:44

I would look into two options

Vaex and Dask.

Vaex seems to be focused exactly on what you need. Lazy processing and very large datasets. Check their github. However it seems, that you need to convert files to hdf5, which may be little bit time consuming.

As far as Dask is concerned, I wouldnt count on success. Dask is primarily focused on distributed computation and I am not really sure if it can process large files lazily. But you can try and see.

score 0 · Answer 3 · answered Dec 22 '20 at 20:05

This approach is pure pandas. It would use two functions : one to compute the indexes, one to read one chunk. I'd say it would fail utterly if any of your groups doesn't fit in memory (but given your criteria that those group must be read one at a time, I'd say it would be a sure guess it fits).

You would need to loop over the dictionnary of indexes (as computed from the first function) to read the entire dataframe.

Hope that will help... (Don't hesitate to adapt the chunksize's default value to your needs).

import pandas as pd

def compute_indexes(url, cols_indexes=[], chunksize=100000, **kwargs):
    """
    Returns a dictionnary
    Keys are the pseudo indexes of the dataframe 
    Values are lists of indexes corresponding to this index
    """
    iterator = pd.read_csv(
            url, 
            usecols=cols_indexes, 
            chunksize=chunksize,
            **kwargs)
    
    dict_groups = dict()
    for df in iterator:
        groups_present = df.drop_duplicates(keep="first").values.tolist()
        df.reset_index(drop=False, inplace=True)
        df.set_index(cols_indexes, inplace=True)
        for group in groups_present:
            group = tuple(group)
            if group not in dict_groups:
                dict_groups[group] = []
            try:
                dict_groups[group] += df.loc[group]['index'].tolist()
            except TypeError:
                #only one row
                dict_groups[group] += [df.loc[group]['index']]
                
    return dict_groups

def read_csv_group(url, dict_groups, which_group, **kwargs):
    if isinstance(which_group, list):
        which_group = tuple(which_group)
    rows = dict_groups[which_group]
    def skip_rows(x):
        if x == 0:
            return False
        elif x in {x+1 for x in rows}:
            return False
        else:
            return True
    df = pd.read_csv(url, skiprows=skip_rows, **kwargs)
    return df
    
URL = "./dummy.csv"
indexes = ['Ticker', 'Date']
kwargs = {'dtype':{'Ticker':str, 'Date':int})
dict_groups = compute_indexes(URL, indexes, chunksize=100000, **kwargs)
df_one_group = read_csv_group(URL, dict_groups, ('AAPL', 20201201), **kwargs)

How do you read a large file with unsorted tabular data in chunks in Python?

3 Answers3