3

I need to to import large datasets and merge them. I know there other questions similar to this but I could not find an answer specific to my problem. It appears that with dask I was able to read the large datasets into a dataframe but I could not merge it with another dataframe.

import dask.dataframe as dd
import pandas as pd

#I have to do this with dask since with pandas I get mem issue and kills the python
ps = dd.read_csv('*.dsv',sep='|',low_memory=False)
mx = dd.read_csv('test.csv',sep='|',low_memory=False)

# this is where I get the error
mg = pd.merge(ps,mx,left_on='ACTIVITY_ID',right_on='WONUM')

ValueError: can not merge DataFrame with instance of type <class 'dask.dataframe.core.DataFrame'>

It is obvious that it cannot merge the dask dataframe with pandas dataframe but how else can I do this? can I use pySpark or any other methods?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
jax
  • 840
  • 2
  • 17
  • 35
  • 4
    I don't know much about dask but I think you just need `dd.merge()` rather than `pd.merge()`? – JohnE Oct 15 '17 at 18:24

1 Answers1

4

@JohnE is right - Dask dataframes have a merge method, which (not coincidentally) is very similar to the pandas one; so, since you seem to need an inner merge, you should simply do:

mg = ps.merge(mx,left_on='ACTIVITY_ID',right_on='WONUM') # how='inner' by default, just as in pandas

The Dask from_pandas method might also be useful, in case you want to convert the Dask dataframes to pandas ones.

desertnaut
  • 57,590
  • 26
  • 140
  • 166