2

I have two dataframes in pandas. I would like to merge these two dataframes, but I keep running into Memory Errors. What is a work around I could use?

Here is the setup:

import pandas as pd

df1 = pd.read_cvs("first1.csv")
df2 = pd.read_csv("second2.csv")
print(df1.shape) # output: (4757076, 4)
print(df2.shape) # output: (428764, 45)


df1.head 

    column1  begin    end    category
0  class1  10001  10468    third
1  class1  10469  11447     third
2  class1  11505  11675     fourth
3  class2  15265  15355   seventh
4  class2  15798  15849   second


print(df2.shape) # (428764, 45)
   column1  begin    .... 
0  class1  10524   .... 
1  class1  10541   ....
2  class1  10549  ....
3  class1  10565  ...
4  class1  10596  ...

I would simply like to merge these two DataFrames on "column1". However, this always causes a memory error.

Let's try this in pandas first, on a system with approximately 2 TB of RAM and hundreds of threads:

import pandas as pd
df1 = pd.read_cvs("first1.csv")
df2 = pd.read_csv("second2.csv")
merged = pd.merge(df1, df2, on="column1", how="outer", suffixes=("","_repeated")

Here's the error I get:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 39, in merge
    return op.get_result()
  File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 217, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
  File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 353, in _get_join_info
    sort=self.sort, how=self.how)
  File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 559, in _get_join_indexers
    return join_func(lkey, rkey, count, **kwargs)
  File "pandas/src/join.pyx", line 160, in pandas.algos.full_outer_join (pandas/algos.c:61256)
MemoryError

That didn't work. Let's try with dask:


import pandas as pd
import dask.dataframe as dd
from numpy import nan


ddf1 = dd.from_pandas(df1, npartitions=2)
ddf2 = dd.from_pandas(df2, npartitions=2)

merged = dd.merge(ddf1, ddf2, on="column1", how="outer", suffixes=("","_repeat")).compute(num_workers=60)

Here's the error I get:

Traceback (most recent call last):
  File "repeat_finder.py", line 15, in <module>
    merged = dd.merge(ddf1, ddf2,on="column1", how="outer", suffixes=("","_repeat")).compute(num_workers=60)
  File "/path/python3.5/site-packages/dask/base.py", line 78, in compute
    return compute(self, **kwargs)[0]
  File "/path/python3.5/site-packages/dask/base.py", line 178, in compute
    results = get(dsk, keys, **kwargs)
  File "/path/python3.5/site-packages/dask/threaded.py", line 69, in get
    **kwargs)
  File "/path/python3.5/site-packages/dask/async.py", line 502, in get_async
    raise(remote_exception(res, tb))
dask.async.MemoryError: 

Traceback
---------
  File "/path/python3.5/site-packages/dask/async.py", line 268, in execute_task
    result = _execute_task(task, data)
  File "/path/python3.5/site-packages/dask/async.py", line 249, in _execute_task
    return func(*args2)
  File "/path/python3.5/site-packages/dask/dataframe/methods.py", line 221, in merge
    suffixes=suffixes, indicator=indicator)
  File "/path/python3.5/site-packages/pandas/tools/merge.py", line 59, in merge
    return op.get_result()
  File "/path/python3.5/site-packages/pandas/tools/merge.py", line 503, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
  File "/path/python3.5/site-packages/pandas/tools/merge.py", line 667, in _get_join_info
    right_indexer) = self._get_join_indexers()
  File "/path/python3.5/site-packages/pandas/tools/merge.py", line 647, in _get_join_indexers
    how=self.how)
  File "/path/python3.5/site-packages/pandas/tools/merge.py", line 876, in _get_join_indexers
    return join_func(lkey, rkey, count, **kwargs)
  File "pandas/src/join.pyx", line 226, in pandas._join.full_outer_join (pandas/src/join.c:11286)
  File "pandas/src/join.pyx", line 231, in pandas._join._get_result_indexer (pandas/src/join.c:11474)
  File "path/python3.5/site-packages/pandas/core/algorithms.py", line 1072, in take_nd
    out = np.empty(out_shape, dtype=dtype, order='F')

How could I get this to work, even if it was shamelessly inefficient?

EDIT: In response to the suggestion of merging on two columns/indices, I don't think I can do this. Here is the code I am trying to run:

import pandas as pd
import dask.dataframe as dd

df1 = pd.read_cvs("first1.csv")
df2 = pd.read_csv("second2.csv")

ddf1 = dd.from_pandas(df1, npartitions=2)
ddf2 = dd.from_pandas(df2, npartitions=2)

merged = dd.merge(ddf1, ddf2, on="column1", how="outer", suffixes=("","_repeat")).compute(num_workers=60)
merged = merged[(ddf1.column1 == row.column1) & (ddf2.begin >= ddf1.begin) & (ddf2.begin <= ddf1.end)]
merged = dd.merge(ddf2, merged, on = ["column1"]).compute(num_workers=60)
merged.to_csv("output.csv", index=False)
EB2127
  • 1,788
  • 3
  • 22
  • 43
  • 1
    "approximately 2 TB of RAM and hundreds of threads" -- wowsers. First, are you on linux? If so check the ulimit and or rlimit for the task. – Brian Cain Nov 23 '16 at 17:50
  • @BrianCain Good idea. Still though---how can I do this? :) These dataframes are not *that* big – EB2127 Nov 24 '16 at 04:42
  • Ok... After looking at your edits, your approach appears wrong, IMHO. Please explain what you intend to do. Seems like you want to clip `merged` to a specific set of rows. What is in `rows`? I think you can solve this problem in a much more simpler manner. – Kartik Nov 28 '16 at 05:01

1 Answers1

-3

You can't just merge the two data frames on column1 only, as column1 is not a unique identifier for each instance in either data frame. Try:

merged = pd.merge(df1, df2, on=["column1", "begin"], how="outer", suffixes=("","_repeated"))

If you also have end column in df2, you may probably need to try:

merged = pd.merge(df1, df2, on=["column1", "begin", "end"], how="outer", suffixes=("","_repeated"))
mikeqfu
  • 329
  • 2
  • 10
  • This does not answer the OP's question. The OP wants an outer join on `"column1"`, and is getting a `MemoryError`. `"column1"` being non-unique does not matter to the merging or the MemoryError. OP probably does not have sufficient resources scheduled for the task on the server. – Kartik Nov 23 '16 at 18:16
  • From my own experience, I encountered similar "MemoryError" issues when merging data frames. `column1` being non-unique may not cause `MemoryError` only when the data size is not too large. Given the example data frames posted in the question, there can be an exponential growth in the size the merged data frame if merging only on `column1`, which would be likely to cause a memory error. I supposed in this case merging on multiple columns, rather than `column1` only, may be more reasonable. – mikeqfu Nov 23 '16 at 19:12
  • Yes, OP is on a 2TB RAM system... The frames OP is dealing with will at most result in a frame of 5185840 x 49. That is nothing compared to 2 TB. My guess is, with a bare bones kind of OS, that data can be merged on a 4GB machine. Easily on an 8 GB machine... – Kartik Nov 23 '16 at 19:32
  • I understand. It might also depend on what IDE was being used, which may have its own constraints. So I was just suggesting that why not give it a try to merge the data on multiple columns. – mikeqfu Nov 23 '16 at 20:05
  • @mikeqfu "You can't just merge the two data frames on column1 only, as column1 is not a unique identifier for each instance in either data frame." I still don't quite follow this. Why is 'column1' not a "unique identifier for each instance in either data frame"? What would be the difference between this column being unique and non-unique? – EB2127 Nov 24 '16 at 04:44
  • You may just use the subsets of your 'df1' (5 rows, 4 cols) and 'df2' (5 rows, 2 cols) as posted in your question. Merge them by using 'column1' and ['column1', 'begin'], respectively. I'm sure you will notice the difference. – mikeqfu Nov 24 '16 at 16:44
  • @mikeqfu Thanks for the response. The problem is, I don't think I can do that. Please see my edit above, where I show the entire code I'm planning to run. As you can see, I'm trying to "search" on intervals, i.e. keep values if they fall in certain intervals. – EB2127 Nov 25 '16 at 04:30
  • @mikeqfu This question explains roughly what I would like to do, with a similar dataframe structure: http://stackoverflow.com/questions/39786406/how-to-match-multiple-columns-in-pandas-dataframe-for-an-interval Can you see how to merge on several indexes in this case? I cannot, though there may be a way. – EB2127 Nov 25 '16 at 04:43
  • @EB2127 I'm sorry if I still misunderstand your question: how about trying to run a loop through all the classes in `column1`, find the values for each 'class', respectively. Say `dfs=[]`; `for x in df1.column1.unique(): df = df1[df1.column1 == x].merge(df2, on='column1', how='outer', suffixes=("", "_repeat"))`; and then find the interval values for each class: `df = df[(df.begin_repeat >= df.begin) & (df.begin_repeat <= df.end)]`; `dfs.append(df)`; and finally `pd.concat(dfs, ignore_index=True)`. This is the only thing I can thing of for now. There must be much better solutions. – mikeqfu Nov 25 '16 at 14:11
  • @EB2127 Otherwise you may try Sframe / Graphlab-Create (Python 2), which may probably better handle relatively large-size data. – mikeqfu Nov 25 '16 at 14:13
  • @mikeqfu Shouldn't dask be more sophisticated than Sframe? – EB2127 Nov 25 '16 at 19:12
  • @EB2127 Sorry I don't have any experience with dask. – mikeqfu Nov 27 '16 at 18:44
  • @mikeqfu I've rephrased the question here: http://stackoverflow.com/questions/40825654/how-to-match-intervals-in-pandas-dask-dataframes-in-a-scalable-manner? Maybe a for loop? Any ideas? – EB2127 Nov 27 '16 at 20:38