I have a largish pandas dataframe (1.5gig .csv on disk). I can load it into memory and query it. I want to create a new column that is combined value of two other columns, and I tried this:
def combined(row):
row['combined'] = row['col1'].join(str(row['col2']))
return row
df = df.apply(combined, axis=1)
This results in my python process being killed, presumably because of memory issues.
A more iterative solution to the problem seems to be:
df['combined'] = ''
col_pos = list(df.columns).index('combined')
crs_pos = list(df.columns).index('col1')
sub_pos = list(df.columns).index('col2')
for row_pos in range(0, len(df) - 1):
df.iloc[row_pos, col_pos] = df.iloc[row_pos, sub_pos].join(str(df.iloc[row_pos, crs_pos]))
This of course seems very unpandas. And is very slow.
Ideally I would like something like apply_chunk()
which is the same as apply but only works on a piece of the dataframe. I thought dask
might be an option for this, but dask
dataframes seemed to have other issues when I used them. This has to be a common problem though, is there a design pattern I should be using for adding columns to large pandas dataframes?