6

I need to use dask to load multiple parquet files with identical schema into a single dataframe. This works when they are all in the same directory, but not when they're in separate directories.

For example:

import fastparquet
pfile = fastparquet.ParquetFile(['data/data1.parq', 'data/data2.parq'])

works just fine, but if I copy data2.parq to a different directory, the following does not work:

pfile = fastparquet.ParquetFile(['data/data1.parq', 'data2/data2.parq'])

The traceback I get is the following:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-b3d381f14edc> in <module>()
----> 1 pfile = fastparquet.ParquetFile(['data/data1.parq', 'data2/data2.parq'])

~/anaconda/envs/hv/lib/python3.6/site-packages/fastparquet/api.py in __init__(self, fn, verify, open_with, sep)
     82         if isinstance(fn, (tuple, list)):
     83             basepath, fmd = metadata_from_many(fn, verify_schema=verify,
---> 84                                                open_with=open_with)
     85             self.fn = sep.join([basepath, '_metadata'])  # effective file
     86             self.fmd = fmd

~/anaconda/envs/hv/lib/python3.6/site-packages/fastparquet/util.py in metadata_from_many(file_list, verify_schema, open_with)
    164     else:
    165         raise ValueError("Merge requires all PaquetFile instances or none")
--> 166     basepath, file_list = analyse_paths(file_list, sep)
    167 
    168     if verify_schema:

~/anaconda/envs/hv/lib/python3.6/site-packages/fastparquet/util.py in analyse_paths(file_list, sep)
    221     if len({tuple([p.split('=')[0] for p in parts[l:-1]])
    222             for parts in path_parts_list}) > 1:
--> 223         raise ValueError('Partitioning directories do not agree')
    224     for path_parts in path_parts_list:
    225         for path_part in path_parts[l:-1]:

ValueError: Partitioning directories do not agree

I get the same error when using dask.dataframe.read_parquet, which I assume uses the same ParquetFile object.

How can I load multiple files from different directories? Putting all the files I need to load into the same directory is not an option.

Tim Morton
  • 240
  • 1
  • 3
  • 11

3 Answers3

4

This does work in fastparquet on master, if using either absolute paths or explicit relative paths:

pfile = fastparquet.ParquetFile(['./data/data1.parq', './data2/data2.parq'])

The need for the leading ./ should be considered a bug - see the issue.

mdurant
  • 27,272
  • 5
  • 45
  • 74
4

The Dask API documentation states:

To read from multiple files you can pass a globstring or a list of paths [...].

The following solution allows for different columns in the individual parquet files, which is not possible for this answer.It will be parallized, because it is a native dask command.

import dask.dataframe as dd


files = ['temp/part.0.parquet', 'temp2/part.1.parquet']
df = dd.read_parquet(files)

df.compute()
user2672299
  • 414
  • 2
  • 12
  • 1
    While this code may resolve the OP's issue, it is best to include an explanation as to how your code addresses the OP's issue. In this way, future visitors can learn from your post, and apply it to their own code. SO is not a coding service, but a resource for knowledge. Also, high quality, complete answers are more likely to be upvoted. These features, along with the requirement that all posts are self-contained, are some of the strengths of SO as a platform, that differentiates it from forums. You can edit to add additional info &/or to supplement your explanations with source documentation. – ysf Jun 17 '20 at 18:48
3

A workaround would be to read each chunk separately and pass to dask.dataframe.from_delayed. This doesn't do exactly the same metadata handling that read_parquet does (below 'index' should be the index), but otherwise should work.

import dask.dataframe as dd    
from dask import delayed    
from fastparquet import ParquetFile

@delayed
def load_chunk(pth):
    return ParquetFile(pth).to_pandas()

files = ['temp/part.0.parquet', 'temp2/part.1.parquet']
df = dd.from_delayed([load_chunk(f) for f in files])

df.compute()
Out[38]: 
   index  a
0      0  1
1      1  2
0      2  3
1      3  4
galath
  • 5,717
  • 10
  • 29
  • 41
chrisb
  • 49,833
  • 8
  • 70
  • 70