How to separate files using dask groupby on a column

Question

I have a large set of csv files (file_1.csv, file_2.csv), separated by time period, that cant be fit into memory. Each file will be in the format mentioned below.


| instrument | time | code     | val           |
|------------|------|----------|---------------|
| 10         | t1   | c1_at_t1 | v_of_c1_at_t1 |
| 10         | t1   | c2_at_t1 | v_of_c2_at_t1 |
| 10         | t2   | c1_at_t2 | v_of_c1_at_t2 |
| 10         | t2   | c3_at_t2 | v_of_c3_at_t2 |
| 11         | t1   | c4_at_t1 | v_of_c4_at_t1 |
| 11         | t1   | c5_at_t1 | v_of_c5_at_t1 |
| 12         | t2   | c6_at_t2 | v_of_c6_at_t2 |
| 13         | t3   | c9_at_t3 | v_of_c9_at_t3 |

Each file is about instrument logs that are consistent in their format. There are set of instruments which can emit different codes(code) at a given timestamp(time). The value of that code at a given time for a given instrument is saved in val column

I would like to split each file (ex: file_1.csv) using the instrument column(ex: 10) and then join the files extracted for the instrument (ex: 10) across all files (file_1.csv, file_2.csv)

I am thinking about using dask groupby operation on the instrument column. Is there any alternative or better approach to do it instead of using groupby or better way to extract the files by instrument?

Code that I have written to do the above operation is

import glob
import dask.dataframe as dd
from dask.distributed import Client

client = Client()

def read_files(files):

    files = glob.glob(files)

    for f in files:

        df = dd.read_csv(f, blocksize='256MB')
        unique_inst = df['instrument'].unique()
        gb = df.groupby('instrument')  

        for v in unique_inst:
            gb.get_group(v).to_parquet(f'{v}_{f[:-4]}.parquet')

    pass

Once I have the files in f'{v}_{f[:-4]}.parquet' format, I can concat them using pandas extracted from all the files (file_1.csv, file_2.csv)

The final file for instrument 10 should be something like below where the observations at t7, t9 are concatenated from observations for instrument 10 in other files

time | code     | val           |
-----|----------|---------------|
t1   | c1_at_t1 | v_of_c1_at_t1 |
t1   | c2_at_t1 | v_of_c2_at_t1 |
t2   | c1_at_t2 | v_of_c1_at_t2 |
t2   | c3_at_t2 | v_of_c3_at_t2 |
t7   | c4_at_t7 | v_of_c4_at_t7 |
t9   | c5_at_t9 | v_of_c5_at_t9 |

It's not clear to me if a single file like`file1.csv` fitting in memory? — rpanai, Nov 21 '19 at 12:10
None of the `csv` files can be fit into memory as the size of each file is more 100 GB — RTM, Nov 21 '19 at 14:59
Wow this changes things a bit. Questions: 1. what are the datatypes? 2. for one file only where `df` is a dask.dataframe is `df.groupby(""instrument")["val"].sum().compute()` giving you a memory error? — rpanai, Nov 21 '19 at 15:11
Sorry if my first statement in the question caused any confusion as I mentioned that large csv files `cant be fit into memory`. All the column values are stored as strings in the csvs. I can do `df.groupby(""instrument")["val"].sum().compute()`, but each column is a string, `sum` will concatenate the column values — RTM, Nov 21 '19 at 15:40
Is it normal that *value* is a `str` or you eventually need to transform to float? What about *time* and *instrument*? Converting type will help you to save lot of memory. — rpanai, Nov 21 '19 at 16:14
If you can try to produce a [mcve](/help/mcve) and try to explain what do you want to achieve — rpanai, Nov 21 '19 at 16:15
The value of `time` should eventually be converted to `datetime` format, `value` to `float` format. For other columns ('instrument` and `code`) can be in `str` format — RTM, Nov 21 '19 at 18:48
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/202821/discussion-between-rtm-and-rpanai). — RTM, Nov 21 '19 at 18:55
I added a brief example of the output format that I am looking for — RTM, Nov 21 '19 at 18:55

score 4 · Accepted Answer · answered Nov 28 '19 at 12:23

I'm not exactly sure what you need to achieve but I don't think you need any group by for your problem. It seems to me a simple filtering issue.

You can just loop over all your files and create new instrument files and append on those.

Also I don't have example files to experiment but I think you can also just use pandas with chunksize to read large csv files.

Example:

import pandas as pd
import glob
import os

# maybe play around to get better performance 
chunksize = 1000000

files = glob.glob('./file_*.csv')
for f in files:

     for chunk in pd.read_csv(f, chunksize=chunksize):
         u_inst = chunk['instrument'].unique()

         for inst in u_inst:
             # filter instrument data
            inst_df = chunk[chunk.instrument == inst]
            # filter columns
            inst_df = inst_df[['time', 'code', 'val']]
            # append to instrument file
            # only write header if not exist yet
            inst_file = f'./instrument_{inst}.csv'
            file_exist = os.path.isfile(inst_file)
            inst_df.to_csv(inst_file, mode='a', header=not file_exist)

Thanks. One caveat of this approach is that we are looping through each chunk however number of `u_inst` are available. What is your opinion on reading each line and appending it to the correct `inst_file` ? or using the same approach mentioned above with `chunksize` of `1` — RTM, Dec 02 '19 at 18:14
The bottleneck in such operations is usually IO. A bigger chunk size allows to read more at once to your memory. The filtering itself on the chunk is a lookup and is super fast. Therefore I would recommend to use a large chunk size. Maybe even larger what I proposed. I would be surprised if chunksize 1 would result in good results. — mjspier, Dec 03 '19 at 08:29
I tried the same approach with chunksize =1. The process is very slow compared to reading in chunks and then filtering by inst — RTM, Dec 04 '19 at 18:22

score 2 · Answer 2 · answered Nov 21 '19 at 13:07

In case every single file fits in memory you can try this:

import dask.dataframe as dd
import pandas as pd
import numpy as np
import os

Generate dummy files

fldr_in = "test_in"
fldr_out = "test_out"

N = int(1e6)
for i in range(10):
    fn = f"{fldr_in}/file{i}.csv"
    os.makedirs(os.path.dirname(fn), exist_ok=True)
    df = pd.DataFrame({"instrument":np.random.randint(10,100,N),
                       "value":np.random.rand(N)})
    df.to_csv(fn, index=False)

Define function

The following function save to parquet for every single instrument in the path fldr_out/instrument=i/fileN.csv

def fun(x, fn, fldr_out):
    inst = x.instrument.unique()[0]
    filename = os.path.basename(fn)
    fn_out = f"{fldr_out}/instrument={inst}/{filename}"
    fn_out = fn_out.replace(".csv", ".parquet")
    os.makedirs(os.path.dirname(fn_out), exist_ok=True)
    x.drop("instrument", axis=1)\
     .to_parquet(fn_out, index=False)

and you can use it with a group by

for f in files:
    fn = f"{fldr_in}/{f}"
    df = pd.read_csv(fn)
    df.groupby("instrument").apply(lambda x: fun(x, fn, fldr_out))

Performing Analysis with dask

Now you can use dask to read the results and perform your analysis

df = dd.read_parquet(fldr_out)

How to separate files using dask groupby on a column

2 Answers2

Generate dummy files

Define function

Performing Analysis with dask