Improve performance of running large files

Question

I know there are a few questions on this topic but I can't seem to get it going efficiently. I have large input datasets (2-3 GB) running on my machine, which contains 8GB of memory. I'm using a version of spyder with pandas 0.24.0 installed. The input file currently takes around an hour to generate an output file of around 10MB.

I since attempted to optimise the process by chunking the input file using the code below. Essentially, I chunk the input file into smaller segments, run it through some code and export the smaller output. I then delete the chunked info to release memory. But the memory still builds throughout the operation and ends up taking a similar amount of time. I'm not sure what I'm doing wrong:

The memory usage details of the file are:

RangeIndex: 5471998 entries, 0 to 5471997
Data columns (total 17 columns):
col1                     object
col2                     object
col3                     object
....
dtypes: object(17)
memory usage: 5.6 GB

I subset the df by passing cols_to_keep to use_cols. But the headers are different for each file so I used location indexing to get the relevant headers.

# Because the column headers change from file to file I use location indexing to read the col headers I need
df_cols = pd.read_csv('file.csv')
# Read cols to be used
df_cols = df_cols.iloc[:,np.r_[1,3,8,12,23]]
# Export col headers
cols_to_keep = df_cols.columns

PATH = '/Volume/Folder/Event/file.csv'
chunksize = 10000

df_list = [] # list to hold the batch dataframe

for df_chunk in pd.read_csv(PATH, chunksize = chunksize, usecols = cols_to_keep):
    # Measure time taken to execute each batch
    print("summation download chunk time: " , time.clock()-t)

    # Execute func1
    df1 = func1(df_chunk)

    # Execute func2
    df2 = func1(df1)

    # Append the chunk to list and merge all
    df_list.append(df2) 

# Merge all dataframes into one dataframe
df = pd.concat(df_list)

# Delete the dataframe list to release memory
del df_list
del df_chunk

I have tried using dask but get all kinds of errors with simple pandas methods.

import dask.dataframe as ddf

df_cols = pd.read_csv('file.csv')
df_cols = df_cols.iloc[:,np.r_[1:3,8,12,23:25,32,42,44,46,65:67,-5:0,]]
cols_to_keep = df_cols.columns

PATH = '/Volume/Folder/Event/file.csv'
blocksize = 10000


df_list = [] # list to hold the batch dataframe


df_chunk = ddf.read_csv(PATH, blocksize = blocksize, usecols = cols_to_keep, parse_dates = ['Time']):
    print("summation download chunk time: " , time.clock()-t)

    # Execute func1
    df1 = func1(df_chunk)

    # Execute func2
    df2 = func1(df1)

    # Append the chunk to list and merge all
    df_list.append(df2)


    delayed_results = [delayed(df2) for df_chunk in df_list]

line that threw error:

df1 = func1(df_chunk)

  name_unq = df['name'].dropna().unique().tolist()

AttributeError: 'Series' object has no attribute 'tolist'

I've passed through numerous functions and it just continues to throw errors.

Del does not free memory right away. It waits for the garbage collector to kick in to free memory if that memory is not in use. You can use gc.collect() to force the gc. — Rambarun Komaljeet, May 18 '20 at 03:34
If row order is not important, why not make each for iteration run in a thread? — Rambarun Komaljeet, May 18 '20 at 03:37
I'll have a look for the `gc.collect()` function. Haven't heard of that before. Could you explain the second comment a little more. How could I run it in a thread? — jonboy, May 18 '20 at 04:05
The memory profiler is the tool that helps to isolate the source of unexpected usage. — brainchild, May 18 '20 at 05:06
@sammywemmy, hoping to use memory on my machine if possible. — jonboy, May 19 '20 at 00:47
Can you please provide the profiler of you choice that measure the time a programm stays at a certain line? That way, we can separate the problamtic lines. — MacOS, May 26 '20 at 09:49
There's no problematic lines per se. It's just the `df_chunk` doesn't get deleted so the segmented data that gets appended continues to build. It's a steady decline in performance as the storage increases. Does that make sense — jonboy, May 26 '20 at 09:52
What kind of calculations are you doing to the data? Is this something that a csv read loop could handle? The csv module reads one line at a time and is very light. Essentially, no memory is used other than the variable that stores each line. — Bobby Ocean, May 27 '20 at 04:27
I'm counting the number of scatter points in segments of a predefined grid. The first few iterations is fine. Half way through though it starts to die. — jonboy, May 27 '20 at 09:49

score 3 · Answer 1 · answered May 18 '20 at 05:40

To process your file, use rather dask, which is intended just to dealing with big (actually, very big) files.

It has also read_csv function, with additional blocksize parameter, to define the size of a single chunk.

The result of read_csv is conceptually a single (dask) DataFrame, which is composed of a sequence of partitions, actually pandasonic DataFrames.

Then you can use map_partitions function, to apply your function to each partition. Because this function (passed to map_partitions) operates on a single partition (pandasonic DataFrame), you can use any code, which you previously tested in Pandas environment.

The advantage of this solution is that processing of individual partitions is divided among available cores, whereas Pandas uses only a single core.

So your loop should be reworked to:

read_csv (dask version),
map_partitions generating partial results from each partition,
concat - to concatenate these partial results (for now the result is still a dask DataFrame,
convert it to a pandasonic DataFrame,
make further use of it in Pandas environment.

To get more details, read about dask.

Get all kind of errors when using `dask` to `read_csv`. It's like it has no pandas functionality at all. Don't know what I'm doing wrong. — jonboy, May 19 '20 at 02:47
Take a look at https://stackoverflow.com/questions/59082723/reading-all-csv-files-at-particular-folder-merge-them-and-find-the-maximum-val. Some time ago I answered a question concerning originally *pandas*, but I proposed a solution based just on *dask* and the PO was very glad. The type of problem is different (aggregation of data from multiple input files), but at least it will show you how to deal with *dask* DataFrames, *read_csv* (in *dask* version) and *map_partition*. — Valdi_Bo, May 19 '20 at 04:01
Yeh it's a little different with the multiple df's. I don't get how I can use map_partitions. thanks for your help though. — jonboy, May 20 '20 at 00:04

Improve performance of running large files

1 Answers1