Converting NaN floats to other types in Parquet format

Question

I currently am processing a bunch of CSV files and transforming them into Parquet. I use these with Hive and query the files directly. I would like to switch over to Dask for my data processing. My data I am reading has optional columns some of which are Boolean types. I know Pandas does not support optional bool types at this time, but is there anyway to specify to either FastParquet or PyArrow what type I would like a field to be? I am fine with the data being a float64 in my DF, but can't have it as such in my Parquet store due to existing files already being an optional Boolean Type.

mdurant · Accepted Answer · 2019-07-29T19:15:42.277

1

You should try using the fastparquet engine, and the following keyword argument

object_encoding={'bool_col': 'bool'}

Also, pandas does now allow boolean columns with nans as an extension type, but it is not yet exactly default. That should work directly.

Example

import fastparquet as fp
df = pd.DataFrame({'a': [0, 1, 'nan']})
fp.write('out.parq', df, object_encoding={'a': 'bool'})
fp.write('out.parq', df.astype(float), object_encoding={'a': 'bool'})

edited Jul 29 '19 at 19:15

answered Jul 26 '19 at 17:23

mdurant

27,272
5
45
74

Thank you, will give this a shot. – Eumcoz Jul 26 '19 at 19:40
Hey, I'm running into an issue with this solution, I am getting`TypeError: expected list of bytes` when I add that argument to `to_parquet`. `testing` is a `float64` in `t_df`, but is actually an optional bool column. The call being made is: `dd.to_parquet(t_df, location, engine='fastparquet', partition_on=['day', 'id'], compression='snappy', write_index=False, object_encoding={'testing': 'bool'})` Any clue whats would be wrong? – Eumcoz Jul 29 '19 at 17:30
As my example shows, seems to work with float column or object columns of int/None (whether the null value is nan or None) – mdurant Jul 29 '19 at 19:20

Converting NaN floats to other types in Parquet format

1 Answers1