I am having two CSV files(approx 4GB each) and I want to check the difference between the entries of these two files.
Suppose Row1 entries in 1.csv doesn't match with row1 of 2.csv but identical to row 100 of 2.csv then it shouldn't show any difference.
Difference should be seen only when there are no identical entries in both of the CSV file. Constraint is can't use any database.
I am using dask.Dataframe to input this files, but I don't see any api's or function to find the difference in Dask documentations.
I even haven't find anyway to convert Dask Dataframes into Panda Dataframes, neither I am able to convert this Dataframes into any of text or CSV file.
Is there any solution for this to compare such huge files and find out the difference.
Please find the sample code I have tried.
import dask.dataframe as dd
import numpy.testing as npt
import pandas as pd
filename1 = '/Users/saikatbhattacharjee/Downloads/2008.csv'
df1 = dd.read_csv(filename1, assume_missing=True)
filename2 = '/Users/saikatbhattacharjee/Downloads/2009.csv'
df2 = dd.read_csv(filename2, assume_missing=True )
def assert_frames_equal(actual, expected, use_close=False):
"""
Compare DataFrame items by index and column and
raise AssertionError if any item is not equal.
Ordering is unimportant, items are compared only by label.
NaN and infinite values are supported.
Parameters
----------
actual : pandas.DataFrame
expected : pandas.DataFrame
use_close : bool, optional
If True, use numpy.testing.assert_allclose instead of
numpy.testing.assert_equal.
"""
if use_close:
comp = npt.assert_allclose
else:
comp = npt.assert_equal
assert (isinstance(actual, pd.DataFrame) and
isinstance(expected, pd.DataFrame)), \
'Inputs must both be pandas DataFrames.'
for i, exp_row in expected.iterrows():
assert i in actual.index, 'Expected row {!r} not
found.'.format(i)
act_row = actual.loc[i]
for j, exp_item in exp_row.iteritems():
assert j in act_row.index, \
'Expected column {!r} not found.'.format(j)
act_item = act_row[j]
if comp(act_item, exp_item):
print("CSV files are identical")
else:
print('The difference in CSV files are'.format(j, i))
actual = pd.DataFrame(df1)
expected = pd.Dataframe(df2)
assert_frames_equal(actual, expected)
I am getting below error on this:
File "/Users/saikatbhattacharjee/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 101, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/Users/saikatbhattacharjee/.spyder-py3/temp.py", line 52, in <module>
actual = pd.DataFrame(df1)
File "/Users/saikatbhattacharjee/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 354, in __init__
raise ValueError('DataFrame constructor not properly called!')
ValueError: DataFrame constructor not properly called!