5

I have a largish pandas dataframe (1.5gig .csv on disk). I can load it into memory and query it. I want to create a new column that is combined value of two other columns, and I tried this:

def combined(row):
    row['combined'] = row['col1'].join(str(row['col2']))
return row

df = df.apply(combined, axis=1)

This results in my python process being killed, presumably because of memory issues.

A more iterative solution to the problem seems to be:

df['combined'] = ''
col_pos = list(df.columns).index('combined')
crs_pos = list(df.columns).index('col1')
sub_pos = list(df.columns).index('col2')

for row_pos in range(0, len(df) - 1):
    df.iloc[row_pos, col_pos] = df.iloc[row_pos, sub_pos].join(str(df.iloc[row_pos, crs_pos]))

This of course seems very unpandas. And is very slow.

Ideally I would like something like apply_chunk() which is the same as apply but only works on a piece of the dataframe. I thought dask might be an option for this, but dask dataframes seemed to have other issues when I used them. This has to be a common problem though, is there a design pattern I should be using for adding columns to large pandas dataframes?

Christopher
  • 633
  • 2
  • 7
  • 19
  • 1
    first of all, `dask` doesn't help in your case, because your problem is I/O and memory bound instead of CPU bound, and `dask` is based on `multiprocessing` module. – Jianxun Li Jul 22 '15 at 20:33
  • Dask can use threading, multiprocessing, or distributed computing. In his case though he needs none of these, merely a way to intelligently handle streaming data from disk. – MRocklin Jul 22 '15 at 22:19
  • whats about using the apply method? – PlagTag Jul 23 '15 at 09:45

2 Answers2

4

I would try using list comprehension + itertools:

df = pd.DataFrame({
    'a': ['ab'] * 200,
    'b': ['ffff'] * 200
})


import itertools

[a.join(b) for (a, b) in itertools.izip(df.a, df.b)]

It might be "unpandas", but pandas doesn't seem to have a .str method that helps you here, and it isn't "unpythonic".

To create another column, just use:

df['c'] = [a.join(b) for (a, b) in itertools.izip(df.a, df.b)]

Incidentally, you can also get your chunking using:

[a.join(b) for (a, b) in itertools.izip(df.a[10: 20], df.b[10: 20])]

If you'd like to play with parallelization. I would first try the above version, as list comprehension and itertools are often surprisingly fast, and parallelization would require an overhead that would need to be outweighed.

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • Wow. This is incredibly mind numbingly fast. Using python3 I ended up with: `df['course_combined'] = [str(a).join(b) for (a, b) in itertools.zip_longest(df['col1'], df['col2'])]` Pandas must be doing something very inefficient with iloc, I would have thought that would have been a fast but ugly way to iterate over the dataframe. – Christopher Jul 22 '15 at 21:00
1

One nice way to create a new column in pandas or dask.dataframe is with the .assign method.

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': [1, 2, 3, 4], 'y': ['a', 'b', 'a', 'b']})

In [3]: df
Out[3]: 
   x  y
0  1  a
1  2  b
2  3  a
3  4  b

In [4]: df.assign(z=df.x * df.y)
Out[4]: 
   x  y     z
0  1  a     a
1  2  b    bb
2  3  a   aaa
3  4  b  bbbb

However, if your operation is highly custom (as it appears to be) and if Python iterators are fast enough (as they seem to be) then you might just want to stick with that. Anytime you find yourself using apply or iloc in a loop it's likely that Pandas is operating much slower than is optimal.

MRocklin
  • 55,641
  • 23
  • 163
  • 235