I've been dealing with larger and larger datasets. I love Python and Pandas and do not want to move away from using these tools. One of my dataframes takes 12 minutes to load. I want to speed this up and it seems like utilizing multiple processors would be the best way.
What is the fastest implementation for reading in tab-separated files that are possibly gzipped? I'm open to using Dask but I just couldn't get it to work.
I can't get the dask way to work from this question because the sample size isn't large enough for the row (don't know how to generalize it) read process and concatenate pandas dataframe in parallel with dask
I've tried the following method to make a faster tsv reader: http://gouthamanbalaraman.com/blog/distributed-processing-pandas.html
def count_lines(path):
return int(subprocess.check_output('wc -l {}'.format(path), shell=True).split()[0])
def _process_frame(df):
return df
def read_df_parallel(path, index_col=0, header=0, compression="infer", engine="c", n_jobs=-1):
# Compression
if compression == "infer":
if path.endswith(".gz"):
compression = "gzip"
# Parallel
if n_jobs == -1:
n_jobs = multiprocessing.cpu_count()
if n_jobs == 1:
df = pd.read_table(path, sep="\t", index_col=np.arange(index_col+1), header=header, compression=compression, engine=engine)
else:
# Set up workers
pool = multiprocessing.Pool(n_jobs)
num_lines = count_lines(path)
chunksize = num_lines // n_jobs
reader = pd.read_table(path, sep="\t", index_col=np.arange(index_col+1), header=header, compression=compression, engine=engine, chunksize=chunksize, iterator=True)
# Iterate through dataframes
df_list = list()
for chunk in reader:
df_tmp = pool.apply_async(_process_frame, [chunk])
df_list.append(df_tmp)
df = pd.concat(f.get() for f in df_list)
return df
Why is the parallel version slower?
What is the fastest implementation for reading in a large gzipped (or not) table into a pandas dataframe?
%%time
path = "./Data/counts/gt2500.counts.tsv.gz"
%timeit read_df_parallel(path, n_jobs=1)
%timeit read_df_parallel(path, n_jobs=-1)
5.62 s ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.81 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
CPU times: user 1min 30s, sys: 8.66 s, total: 1min 38s
Wall time: 1min 39s