2

This is a question related to this post.

I am experimenting with Dask and Parquet files. I loaded the New York parking violations data I downloaded here.

I read the data files, find common columns, apply datatypes, and save all afterwards as a parquet collevtion

from dask import dataframe as dd
from dask.diagnostics import ProgressBar
import numpy as np

base_url = 'origin/nyc-parking-tickets/'

fy14 = dd.read_csv(base_url + '*2014_.csv')
fy15 = dd.read_csv(base_url + '*2015.csv')
fy16 = dd.read_csv(base_url + '*2016.csv')
fy17 = dd.read_csv(base_url + '*2017.csv')

data = [fy14, fy15, fy16, fy17]
col_set = [set(d.columns) for d in data]
common_columns = list(set.intersection(*col_set))

# Set proper column types
dtype_tuples = [(x, np.str) for x in common_columns]
dtypes = dict(dtype_tuples)

floats = ['Feet From Curb', 'Issuer Code', 'Issuer Precinct', 'Law Section', 'Vehicle Year', 'Violation Precinct']
ints32 = ['Street Code1', 'Street Code2', 'Street Code3', 'Summons Number']
ints16 = ['Violation Code']

for item in floats: dtypes[item] = np.float32
for item in ints32: dtypes[item] = np.int32
for item in ints16: dtypes[item] = np.int16

# Read Data
data = dd.read_csv(base_url + '*.csv', dtype=dtypes, usecols=common_columns) # usecols not in Dask documentation, but from pandas

# Write data as parquet
target_url = 'target/nyc-parking-tickets-pq/'
with ProgressBar():
    data.to_parquet(target_url)

When I attempt to reload the data

data2 = dd.read_parquet(target_url, engine='pyarrow')

I get a ValueError, namely that some of the partitions have a different file format. Looking at the output, I can see that the 'Violation Legal Code' column is in one partition interpreted as null, presumably because the data is too sparse for sampling.

In the post with the original question two solutions are suggested. The first is about entering dummy values, the other is supplying column types when loading the data. I would like to do the latter and I am stuck. In the dd.read_csv method I can pass the dtype argument, for which I just enter the dtypes dictionary defined above. The dd.read_parquet does not accept that keyword. In the documentation it seems to suggest that categories is taking over that role , but even when passing categories=dtypes, I still get the same error.

How can I pass type specifications in dask.dataframe.read_parquet?

AJK
  • 107
  • 2
  • 10

2 Answers2

1

You can not pass dtypes to read_parquet because Parquet files know their own dtypes (in CSV it is ambiguous). Dask DataFrame expects that all files of a dataset have the same schema, as of 2019-03-26, there is no support for loading data of mixed schemas.

That being said, you could do this yourself using something like Dask Delayed, do whatever manipulations you need to do on a file-by-file basis, and then convert those into a Dask DataFrame with dd.from_delayed. More information about that is here.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Thank you very much. So the .to_parquet method would not impart that information into the file metadata? – AJK Mar 27 '19 at 13:54
  • When I run ``` data = dd.read_parquet(target_url) ``` I do get an error: ``` Scheme in [...]\part.1.parquet was different ``` – AJK Mar 29 '19 at 09:56
  • When I run ``` data = dd.read_parquet(target_url) ``` I do get an error: ``` Scheme in [...]\part.1.parquet was different ``` I did not run `compute` or `head` to actively load the file. So for some reason `parquet` did not save all its parts in the same schema. But I cannot do any manipulations on the data partitions, as the exception prevents that. Or do you mean I just run something like ``` delayed(dd.read_parquet) ``` ? – AJK Mar 29 '19 at 10:02
  • Would be helpful to see an actual example. – Jonathan Nov 04 '20 at 22:07
1

It seems the problem was with the parquet engine. When I changed the code to

data.to_parquet(target_url, engine = 'fastparquet')

and

data.from_parquet(target_url, engine = 'fastparquet')

the writing and loading worked fine.

AJK
  • 107
  • 2
  • 10