1

I've been dealing with larger and larger datasets. I love Python and Pandas and do not want to move away from using these tools. One of my dataframes takes 12 minutes to load. I want to speed this up and it seems like utilizing multiple processors would be the best way.

What is the fastest implementation for reading in tab-separated files that are possibly gzipped? I'm open to using Dask but I just couldn't get it to work.

I can't get the dask way to work from this question because the sample size isn't large enough for the row (don't know how to generalize it) read process and concatenate pandas dataframe in parallel with dask

I've tried the following method to make a faster tsv reader: http://gouthamanbalaraman.com/blog/distributed-processing-pandas.html

def count_lines(path):
    return int(subprocess.check_output('wc -l {}'.format(path), shell=True).split()[0])
def _process_frame(df):
    return df
def read_df_parallel(path, index_col=0, header=0, compression="infer", engine="c", n_jobs=-1):
    # Compression
    if compression == "infer":
        if path.endswith(".gz"):
            compression = "gzip"
    # Parallel
    if n_jobs == -1:
        n_jobs = multiprocessing.cpu_count()
    if n_jobs == 1:
        df = pd.read_table(path, sep="\t", index_col=np.arange(index_col+1), header=header, compression=compression, engine=engine)
    else:
        # Set up workers
        pool = multiprocessing.Pool(n_jobs)
        num_lines = count_lines(path)
        chunksize = num_lines // n_jobs
        reader = pd.read_table(path, sep="\t", index_col=np.arange(index_col+1), header=header, compression=compression, engine=engine, chunksize=chunksize, iterator=True)
        # Iterate through dataframes
        df_list = list()
        for chunk in reader:
            df_tmp = pool.apply_async(_process_frame, [chunk])
            df_list.append(df_tmp)
        df = pd.concat(f.get() for f in df_list)
    return df

Why is the parallel version slower?

What is the fastest implementation for reading in a large gzipped (or not) table into a pandas dataframe?

%%time
path = "./Data/counts/gt2500.counts.tsv.gz"
%timeit read_df_parallel(path, n_jobs=1)
%timeit read_df_parallel(path, n_jobs=-1)

5.62 s ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.81 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
CPU times: user 1min 30s, sys: 8.66 s, total: 1min 38s
Wall time: 1min 39s
O.rka
  • 29,847
  • 68
  • 194
  • 309
  • 1
    Probably because `multiprocessing` has a lot of overhead, and `pd.concat` is similarly an expensive operation. Just to start. – juanpa.arrivillaga Jun 23 '17 at 20:53
  • Would it be better if I used dask.delayed or joblib.delayed? – O.rka Jun 23 '17 at 20:56
  • Actually, looking closer, your parallelization pretty much does nothing. The function you pass to `apply_async` just does `return df`. So, the entire data-set is read serially inside `read_df_parallel`, passed as chunks to the `multiprocessing` stuff, then recombined. You'd be better off skipping the multiprocessing and doing `df_list = list(reader)`. Or better yet, as your results prove, simply using `df = pd.read_csv(...)` – juanpa.arrivillaga Jun 23 '17 at 21:00
  • 12 minutes to load sounds like a big data frame. Why not save yourself a headache and spin up an EC2 instance? – gold_cy Jun 23 '17 at 21:01
  • 3
    It looks like you're reading everything off the same hard drive. Generally, when reading in data your bottleneck is disk i/o, not cpu performance. Running things in parallel doesn't help i/o performance from a single hard drive, and may even hurt it, as you have different processes competing for reads at the same time. Add in the overhead for multiprocessing in general, and that would explain why you're seeing worse performance. – root Jun 23 '17 at 22:18
  • I'll try this: https://statcompute.wordpress.com/2015/12/27/import-csv-by-chunk-simultaneously-with-ipython-parallel/ – O.rka Jun 23 '17 at 22:49
  • I found this too which could be useful. http://www.wise.io/tech/paratext – O.rka Jun 26 '17 at 18:09
  • Try feather :) . Super fast just make sure there numerical indicies: https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/ – O.rka Aug 02 '17 at 20:58
  • you can do `reset_index` before saving and then `set_index` to get it back if you don't have numerical indices – O.rka Aug 02 '17 at 22:37

0 Answers0