Arrow related error when pushing dataset to Hugging-face hub

Question

i have quite a problem with my dataset:

The (future) dataset is a pandas dataframe that i loaded from a pickle file, the pandas dataset behaves correctly. My code is:

dataset.from_pandas(df)
dataset.push_to_hub("username/my_dataset", private=True)

because I thought it was pandas fault I also tried:

dataset = Dataset.from_dict(df_sentences.to_dict(orient='list'))
dataset.push_to_hub("username/my_dataset", private=True)

and to load it from file.

The error I get is:

ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema conversion: string

My dataset is composed by 4 columns of type string and one of ints, around 3600 rows

score 1 · Accepted Answer · answered Jan 24 '23 at 12:43

1

Without having a reproducible sample, it is hard to test it, but one option is to convert data to string[pyarrow] dtype:

dtypes = {
'column_a': 'string[pyarrow]',
'col_b': 'string[pyarrow]',
...
}

df_converted = df.astype(dtypes)
# proceed with the push

If possible, I would also upgrade to the latest versions, esp. for pyarrow and pandas.

answered Jan 24 '23 at 12:43

SultanOrazbayev

I had this problem with BigQuery, it was indeed the column types. – razimbres Jan 24 '23 at 12:47
@razimbres what was the problem in your case? I tried forcing in pandas the dtype, I also checked if there was any nan or similar but got 0 – Tsadoq Jan 24 '23 at 13:13
In my case the problem was the date. I applied 'pd.to_datetime(df.Date)' and it messed the pyarrow config. When I removed that, it worked. – razimbres Jan 24 '23 at 13:22

1 Answers1