-1

I recently download a dataset from HuggingFace HuggingFace.

I've used datasets.Dataset.load_dataset() and it gives me a Dataset backed by an Apache Arrow table. So I have problems to export the data into a DataFrame to work with pandas.

The structure of the dataset object is this:

DatasetDict({
    train: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 1200000
    })
    validation: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 30000
    })
    test: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 30000
    })
})
  • dataset['train'].features
{'review_id': Value(dtype='string', id=None),
 'product_id': Value(dtype='string', id=None),
 'reviewer_id': Value(dtype='string', id=None),
 'stars': Value(dtype='int32', id=None),
 'review_body': Value(dtype='string', id=None),
 'review_title': Value(dtype='string', id=None),
 'language': Value(dtype='string', id=None),
 'product_category': Value(dtype='string', id=None)}

I would like to export each Train, Test and Validation into three differentes DataFrames.

Thank you!

1 Answers1

0

You can use the to_pandas() functionality offered by HuggingFace.

df_train = dataset['train'].to_pandas()
df_test = dataset['test'].to_pandas()
df_val = dataset['validation'].to_pandas()

P. Shroff
  • 396
  • 3
  • 5