I currently am processing a bunch of CSV files and transforming them into Parquet. I use these with Hive and query the files directly. I would like to switch over to Dask for my data processing. My data I am reading has optional columns some of which are Boolean types. I know Pandas does not support optional bool types at this time, but is there anyway to specify to either FastParquet or PyArrow what type I would like a field to be? I am fine with the data being a float64 in my DF, but can't have it as such in my Parquet store due to existing files already being an optional Boolean Type.
Asked
Active
Viewed 1,428 times
1 Answers
1
You should try using the fastparquet
engine, and the following keyword argument
object_encoding={'bool_col': 'bool'}
Also, pandas does now allow boolean columns with nans as an extension type, but it is not yet exactly default. That should work directly.
Example
import fastparquet as fp
df = pd.DataFrame({'a': [0, 1, 'nan']})
fp.write('out.parq', df, object_encoding={'a': 'bool'})
fp.write('out.parq', df.astype(float), object_encoding={'a': 'bool'})

mdurant
- 27,272
- 5
- 45
- 74
-
Thank you, will give this a shot. – Eumcoz Jul 26 '19 at 19:40
-
Hey, I'm running into an issue with this solution, I am getting`TypeError: expected list of bytes` when I add that argument to `to_parquet`. `testing` is a `float64` in `t_df`, but is actually an optional bool column. The call being made is: `dd.to_parquet(t_df, location, engine='fastparquet', partition_on=['day', 'id'], compression='snappy', write_index=False, object_encoding={'testing': 'bool'})` Any clue whats would be wrong? – Eumcoz Jul 29 '19 at 17:30
-
As my example shows, seems to work with float column or object columns of int/None (whether the null value is nan or None) – mdurant Jul 29 '19 at 19:20