3

I have a large CSV file(>100 GB) that I want to read into memory and process the data in chunks. There are two constraints I have:

  1. Obviously I cannot read the whole entire file into memory. I only have about 8GB of ram on my machine.
  2. The data is tabular and unordered. I need to read the data in groups.
Ticker Date Field1 Field2 Field3
AAPL 20201201 0 0 0
AAPL 20201202 0 0 0
AAPL 20201203 0 0 0
AAPL 20201204 0 0 0
NFLX 20201201 0 0 0
NFLX 20201202 0 0 0
NFLX 20201203 0 0 0
NFLX 20201204 0 0 0

The concern here is that the data has to be read in groups. Grouped by Ticker and date. If I say I want to read 10,000 records in each batch. The boundary of that batch should not split groups. i.e. All the AAPL data for 2020 December should end up in the same batch. That data should not appear in two batches.

Most of my co-workers when they face a situation like this, they usually create a bash script where they use awk, cut, sort, uniq to divide data into groups and write out multiple intermediate files to the disk. Then they use Python to process these files. I was wondering if there is a homogenous Python/Pandas/Numpy solution to this.

Aditya
  • 1,240
  • 2
  • 14
  • 38
  • The csv reader of pandas has a chunksize parameter – Dani Mesejo Dec 20 '20 at 20:03
  • Yeah but that's determined by the number of rows, not a value. Try this solution: https://stackoverflow.com/questions/42228770/load-pandas-dataframe-with-chunksize-determined-by-column-variable – Alexander Riedel Dec 20 '20 at 20:05
  • I would suggest taking a look at Dask: https://stackoverflow.com/questions/39426511/can-dask-be-used-to-groupby-and-recode-out-of-core, it is designed exactly for out of core needs like you mention – anon01 Dec 20 '20 at 20:10
  • and on a practical note: just buy more RAM – anon01 Dec 20 '20 at 20:13
  • 1
    Have you considered [dask](https://dask.org/)? – blacksite Dec 20 '20 at 20:14
  • Depending on the dataframe size (and the weight of the columns to be accurate), you may also try reading the 2 columns to compute indexes (using chunks and setting the dtype to categorical for ticker). – tgrandje Dec 20 '20 at 20:15
  • So does the data of any given month fits in memory? – Dani Mesejo Dec 20 '20 at 20:17

3 Answers3

0

How about this:

  1. open the file
  2. loop over reading lines: For each line read:
  • parse the ticker
  • if not done already:
    • create+open a file for that ticker ("ticker file")
    • append to some dict where key=ticker and value=file handle
  • write the line to the ticker file
  1. close the ticker files and the original file
  2. process each single ticker file
0

I would look into two options

Vaex and Dask.

Vaex seems to be focused exactly on what you need. Lazy processing and very large datasets. Check their github. However it seems, that you need to convert files to hdf5, which may be little bit time consuming.

As far as Dask is concerned, I wouldnt count on success. Dask is primarily focused on distributed computation and I am not really sure if it can process large files lazily. But you can try and see.

Martin
  • 3,333
  • 2
  • 18
  • 39
0

This approach is pure pandas. It would use two functions : one to compute the indexes, one to read one chunk. I'd say it would fail utterly if any of your groups doesn't fit in memory (but given your criteria that those group must be read one at a time, I'd say it would be a sure guess it fits).

You would need to loop over the dictionnary of indexes (as computed from the first function) to read the entire dataframe.

Hope that will help... (Don't hesitate to adapt the chunksize's default value to your needs).

import pandas as pd

def compute_indexes(url, cols_indexes=[], chunksize=100000, **kwargs):
    """
    Returns a dictionnary
    Keys are the pseudo indexes of the dataframe 
    Values are lists of indexes corresponding to this index
    """
    iterator = pd.read_csv(
            url, 
            usecols=cols_indexes, 
            chunksize=chunksize,
            **kwargs)
    
    dict_groups = dict()
    for df in iterator:
        groups_present = df.drop_duplicates(keep="first").values.tolist()
        df.reset_index(drop=False, inplace=True)
        df.set_index(cols_indexes, inplace=True)
        for group in groups_present:
            group = tuple(group)
            if group not in dict_groups:
                dict_groups[group] = []
            try:
                dict_groups[group] += df.loc[group]['index'].tolist()
            except TypeError:
                #only one row
                dict_groups[group] += [df.loc[group]['index']]
                
    return dict_groups

def read_csv_group(url, dict_groups, which_group, **kwargs):
    if isinstance(which_group, list):
        which_group = tuple(which_group)
    rows = dict_groups[which_group]
    def skip_rows(x):
        if x == 0:
            return False
        elif x in {x+1 for x in rows}:
            return False
        else:
            return True
    df = pd.read_csv(url, skiprows=skip_rows, **kwargs)
    return df
    
URL = "./dummy.csv"
indexes = ['Ticker', 'Date']
kwargs = {'dtype':{'Ticker':str, 'Date':int})
dict_groups = compute_indexes(URL, indexes, chunksize=100000, **kwargs)
df_one_group = read_csv_group(URL, dict_groups, ('AAPL', 20201201), **kwargs)
tgrandje
  • 2,332
  • 11
  • 33