How can I compare two large CSV files using Dask

Question

I am having two CSV files(approx 4GB each) and I want to check the difference between the entries of these two files.

Suppose Row1 entries in 1.csv doesn't match with row1 of 2.csv but identical to row 100 of 2.csv then it shouldn't show any difference.

Difference should be seen only when there are no identical entries in both of the CSV file. Constraint is can't use any database.

I am using dask.Dataframe to input this files, but I don't see any api's or function to find the difference in Dask documentations.

I even haven't find anyway to convert Dask Dataframes into Panda Dataframes, neither I am able to convert this Dataframes into any of text or CSV file.

Is there any solution for this to compare such huge files and find out the difference.

Please find the sample code I have tried.

import dask.dataframe as dd
import numpy.testing as npt
import pandas as pd

filename1 = '/Users/saikatbhattacharjee/Downloads/2008.csv'
df1 = dd.read_csv(filename1, assume_missing=True)
filename2 = '/Users/saikatbhattacharjee/Downloads/2009.csv'
df2 = dd.read_csv(filename2, assume_missing=True )

def assert_frames_equal(actual, expected, use_close=False):
    """
    Compare DataFrame items by index and column and
    raise AssertionError if any item is not equal.

    Ordering is unimportant, items are compared only by label.
    NaN and infinite values are supported.

    Parameters
    ----------
    actual : pandas.DataFrame
    expected : pandas.DataFrame
    use_close : bool, optional
    If True, use numpy.testing.assert_allclose instead of
    numpy.testing.assert_equal.

    """
    if use_close:
        comp = npt.assert_allclose
    else:
        comp = npt.assert_equal

    assert (isinstance(actual, pd.DataFrame) and
            isinstance(expected, pd.DataFrame)), \
        'Inputs must both be pandas DataFrames.'

    for i, exp_row in expected.iterrows():
        assert i in actual.index, 'Expected row {!r} not 
        found.'.format(i)

        act_row = actual.loc[i]

        for j, exp_item in exp_row.iteritems():
            assert j in act_row.index, \
                'Expected column {!r} not found.'.format(j)

            act_item = act_row[j]

            if comp(act_item, exp_item):
               print("CSV files are identical") 
            else:
                print('The difference in CSV files are'.format(j, i))

actual = pd.DataFrame(df1)
expected = pd.Dataframe(df2)
assert_frames_equal(actual, expected)

I am getting below error on this:

File "/Users/saikatbhattacharjee/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 101, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "/Users/saikatbhattacharjee/.spyder-py3/temp.py", line 52, in <module>
    actual = pd.DataFrame(df1)

  File "/Users/saikatbhattacharjee/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 354, in __init__
    raise ValueError('DataFrame constructor not properly called!')

ValueError: DataFrame constructor not properly called!

score 1 · Answer 1 · answered Jan 18 '18 at 16:19

1

A couple of quick answers (and I would say this is pretty well documented)

To convert a dask dataframe to a pandas one:

actual = df1.compute()

(make sure you have enough memory to do this!)

To write data out to CSV

actual.to_csv('myfiles*.csv')

This writes one file per partition, because multiple threads/processes cannot write to a single file simultaneously in any sensible manner; you can concatenate the files (removing the header, or having used header=False) into one, if you wish.

answered Jan 18 '18 at 16:19

mdurant

27,272
5
45
74

"actual.to_csv('myfiles*.csv')" I have seen this in Dask API but i dont think it will solve my purpose as it will create separate csv files. Also suppose if the filename2 has a similar entry in some different row in comparison to filename1, it will be very difficult comparing the csv files of df1 and df2. – Saikat Jan 18 '18 at 17:13
Hi @mdurant is there any way i can compare df1 and df2 and store the result in another dask object – Saikat Jan 18 '18 at 17:47
Sure, I understand that is your main question, but the two points here answer other questions you posed. – mdurant Jan 18 '18 at 18:09

How can I compare two large CSV files using Dask

1 Answers1

To convert a dask dataframe to a pandas one:

To write data out to CSV