I know there are a few questions on this topic but I can't seem to get it going efficiently. I have large input datasets (2-3 GB)
running on my machine, which contains 8GB
of memory. I'm using a version of spyder
with pandas 0.24.0
installed. The input file currently takes around an hour to generate an output file of around 10MB
.
I since attempted to optimise the process by chunking the input file using the code below. Essentially, I chunk
the input file into smaller segments, run it through some code and export the smaller output. I then delete the chunked info to release memory. But the memory still builds throughout the operation and ends up taking a similar amount of time. I'm not sure what I'm doing wrong:
The memory usage details of the file are:
RangeIndex: 5471998 entries, 0 to 5471997
Data columns (total 17 columns):
col1 object
col2 object
col3 object
....
dtypes: object(17)
memory usage: 5.6 GB
I subset the df by passing cols_to_keep
to use_cols
. But the headers are different for each file so I used location indexing to get the relevant headers.
# Because the column headers change from file to file I use location indexing to read the col headers I need
df_cols = pd.read_csv('file.csv')
# Read cols to be used
df_cols = df_cols.iloc[:,np.r_[1,3,8,12,23]]
# Export col headers
cols_to_keep = df_cols.columns
PATH = '/Volume/Folder/Event/file.csv'
chunksize = 10000
df_list = [] # list to hold the batch dataframe
for df_chunk in pd.read_csv(PATH, chunksize = chunksize, usecols = cols_to_keep):
# Measure time taken to execute each batch
print("summation download chunk time: " , time.clock()-t)
# Execute func1
df1 = func1(df_chunk)
# Execute func2
df2 = func1(df1)
# Append the chunk to list and merge all
df_list.append(df2)
# Merge all dataframes into one dataframe
df = pd.concat(df_list)
# Delete the dataframe list to release memory
del df_list
del df_chunk
I have tried using dask but get all kinds of errors with simple pandas methods.
import dask.dataframe as ddf
df_cols = pd.read_csv('file.csv')
df_cols = df_cols.iloc[:,np.r_[1:3,8,12,23:25,32,42,44,46,65:67,-5:0,]]
cols_to_keep = df_cols.columns
PATH = '/Volume/Folder/Event/file.csv'
blocksize = 10000
df_list = [] # list to hold the batch dataframe
df_chunk = ddf.read_csv(PATH, blocksize = blocksize, usecols = cols_to_keep, parse_dates = ['Time']):
print("summation download chunk time: " , time.clock()-t)
# Execute func1
df1 = func1(df_chunk)
# Execute func2
df2 = func1(df1)
# Append the chunk to list and merge all
df_list.append(df2)
delayed_results = [delayed(df2) for df_chunk in df_list]
line that threw error:
df1 = func1(df_chunk)
name_unq = df['name'].dropna().unique().tolist()
AttributeError: 'Series' object has no attribute 'tolist'
I've passed through numerous functions and it just continues to throw errors.