0

This mapping works when calling head on the first 100 rows:

ddf['val'] = ddf['myid'].map( val['val'] , meta=pd.Series(float) )

But when I try to save to parquet:

ddf.to_parquet('myfile.parquet', 
               compression='snappy', 
               write_index=False,
               compute_kwargs={'scheduler':'threads'}
              )

I am getting an error: InvalidIndexError: Reindexing only valid with uniquely valued Index objects.

But checking my index (after converting to pandas series), it is unique: val.index.duplicated().any() is False. Also, the index is the same set as the dataframe column it is being mapped to: myid. There are no nulls, nans, or Nones in the index. The index is int64.

Update: curiously, if I load each parquet file for the original ddf one at a time, this does not error. If I load more than one at a time, it errors out.

scottlittle
  • 18,866
  • 8
  • 51
  • 70

1 Answers1

0

This could be a bug in the fastparquet engine. I saved the underlying dataframes with pyarrow and used to_parquet with engine='pyarrow' and things are working now:

ddf.to_parquet('myfile.parquet', 
               engine='pyarrow',
               compression='snappy', 
               write_index=False,
               compute_kwargs={'scheduler':'threads'}
              )
scottlittle
  • 18,866
  • 8
  • 51
  • 70
  • 1
    If you can figure out the bug, feel free to post to dask or fastparquet, depending on where the bug lies. – mdurant Jul 10 '20 at 18:17
  • @mdurant Not sure where the bug ultimately lies, but created a toy example that was having issues with pyarrow as well. So, I created the issue under dask for the bug: https://github.com/dask/dask/issues/6394. – scottlittle Jul 10 '20 at 22:44