How to convert pandas data frame to Huggingface Dataset grouped by column value?

Question

I have the following data frame df

import pandas as pd
from datasets import Dataset

data = [[1, 'Jack', 'A'], [1, 'Jamie', 'A'], [1, 'Mo', 'B'], [1, 'Tammy', 'A'], [2, 'JJ', 'A'], [2, 'Perry', 'C']]
df = pd.DataFrame(data, columns=['id', 'name', 'class'])
> df
  id   name class
0   1   Jack     A
1   1  Jamie     A
2   1     Mo     B
3   1  Tammy     A
4   2     JJ     A
5   2  Perry     C

I would like to covert it to a Dataset object that has 2 rows, one per id. The desired output is

> myDataset
Dataset({
    features: ['id', 'name', 'class'],
    num_rows: 2
})

where

> myDataset[0:2]
{'id': ['1', '2'], 'name': [['Jack', 'Jamie', 'Mo', 'Tammy'],['JJ', 'Perry']], 'class': [['A', 'A', 'B', 'A'], ['A', 'C']]}

Based on the documentation here, I tried the following but that gave me a Dataset with 6 rows, instead of one with 2 rows and grouped by the column id

myDataset = Dataset.from_pandas(df) 
> myDataset
Dataset({
    features: ['id', 'name', 'class'],
    num_rows: 6
})
> myDataste[0:2]
{'id': [1, 1], 'name': ['Jack', 'Jamie'], 'class': ['A', 'A']}

score 0 · Accepted Answer · answered May 07 '23 at 07:06

0

You can try to aggregate the original dataframe by id

myDataset = Dataset.from_pandas(df.groupby('id', as_index=False).agg(list))

answered May 07 '23 at 07:06

Ynjxsjmh

28,441
6
34
52

How to convert pandas data frame to Huggingface Dataset grouped by column value?

1 Answers1