InvalidIndexError error mapping dask series

Question

This mapping works when calling head on the first 100 rows:

ddf['val'] = ddf['myid'].map( val['val'] , meta=pd.Series(float) )

But when I try to save to parquet:

ddf.to_parquet('myfile.parquet', 
               compression='snappy', 
               write_index=False,
               compute_kwargs={'scheduler':'threads'}
              )

I am getting an error: InvalidIndexError: Reindexing only valid with uniquely valued Index objects.

But checking my index (after converting to pandas series), it is unique: val.index.duplicated().any() is False. Also, the index is the same set as the dataframe column it is being mapped to: myid. There are no nulls, nans, or Nones in the index. The index is int64.

Update: curiously, if I load each parquet file for the original ddf one at a time, this does not error. If I load more than one at a time, it errors out.

fastparquet 0.4.0, pyarrow 0.17.1, python-snappy 0.5.4, snappy.x86_64 1.1.0-3.el7 — scottlittle, Jul 10 '20 at 19:11

score 0 · Answer 1 · answered Jul 10 '20 at 17:47

0

This could be a bug in the fastparquet engine. I saved the underlying dataframes with pyarrow and used to_parquet with engine='pyarrow' and things are working now:

ddf.to_parquet('myfile.parquet', 
               engine='pyarrow',
               compression='snappy', 
               write_index=False,
               compute_kwargs={'scheduler':'threads'}
              )

answered Jul 10 '20 at 17:47

scottlittle

18,866
8
51
70

1

If you can figure out the bug, feel free to post to dask or fastparquet, depending on where the bug lies. – mdurant Jul 10 '20 at 18:17
@mdurant Not sure where the bug ultimately lies, but created a toy example that was having issues with pyarrow as well. So, I created the issue under dask for the bug: https://github.com/dask/dask/issues/6394. – scottlittle Jul 10 '20 at 22:44

InvalidIndexError error mapping dask series

1 Answers1