9

I have a dataframe, which has two columns (review and sentiment). I am using pytorch and torchtext library for preprocessing data. Is it possible to use dataframe as source to read data from, in torchtext? I am looking for something similar to, but not

data.TabularDataset.splits(path='./data')

I have performed some operation (clean, change to required format) on data and final data is in a dataframe.

If not torchtext, what other package would you suggest that would help in preprocessing text data present in a datarame. I could not find anything online. Any help would be great.

Newbie
  • 530
  • 1
  • 10
  • 21

2 Answers2

15

Adapting the Dataset and Example classes from torchtext.data

    from torchtext.data import Field, Dataset, Example
    import pandas as pd

     class DataFrameDataset(Dataset):
         """Class for using pandas DataFrames as a datasource"""
         def __init__(self, examples, fields, filter_pred=None):
             """
             Create a dataset from a pandas dataframe of examples and Fields
             Arguments:
                 examples pd.DataFrame: DataFrame of examples
                 fields {str: Field}: The Fields to use in this tuple. The
                     string is a field name, and the Field is the associated field.
                 filter_pred (callable or None): use only exanples for which
                     filter_pred(example) is true, or use all examples if None.
                     Default is None
             """
             self.examples = examples.apply(SeriesExample.fromSeries, args=(fields,), axis=1).tolist()
             if filter_pred is not None:
                 self.examples = filter(filter_pred, self.examples)
             self.fields = dict(fields)
             # Unpack field tuples
             for n, f in list(self.fields.items()):
                 if isinstance(n, tuple):
                     self.fields.update(zip(n, f))
                     del self.fields[n]

     class SeriesExample(Example):
         """Class to convert a pandas Series to an Example"""
        
         @classmethod
         def fromSeries(cls, data, fields):
             return cls.fromdict(data.to_dict(), fields)

         @classmethod
         def fromdict(cls, data, fields):
             ex = cls()
             
             for key, field in fields.items():
                 if key not in data:
                     raise ValueError("Specified key {} was not found in "
                     "the input data".format(key))
                 if field is not None:
                     setattr(ex, key, field.preprocess(data[key]))
                 else:
                     setattr(ex, key, data[key])
             return ex

Then, first define fields using torchtext.data fields. For example:

    TEXT = data.Field(tokenize='spacy')
    LABEL = data.LabelField(dtype=torch.float)
    TEXT.build_vocab(train, max_size=25000, vectors="glove.6B.100d") 
    LABEL.build_vocab(train)
    fields = { 'sentiment' : LABEL, 'review' : TEXT }

before simply loading the dataframes:

    train_ds = DataFrameDataset(train_df, fields)
    valid_ds = DataFrameDataset(valid_df, fields)
stackoverflowuser2010
  • 38,621
  • 48
  • 169
  • 217
Geoffrey Negiar
  • 809
  • 7
  • 28
  • I have tried implementing this, but it is not clear what "fields" should consist of or how it is constructed. In the questions case with two "Keys" in the dataframe: review and sentiment. Any further elaboration would highly appreciated – NicolaiF Jan 15 '19 at 12:34
  • 3
    Figured it out, it should be in the format of a dictionary where each key is series name and each value is what to do them: fields = { 'sentiment' : LABEL, 'review' : TEXT } where label and text are torchtext data fields such as: TEXT = data.Field(tokenize='spacy') LABEL = data.LabelField(dtype=torch.float) TEXT.build_vocab(train, max_size=25000, vectors="glove.6B.100d") LABEL.build_vocab(train) – NicolaiF Jan 16 '19 at 09:42
  • +1 because this implementation follows the original implementation logic and style https://pytorch.org/text/_modules/torchtext/data/example.html#Example.fromlist – Jason Angel Jun 21 '20 at 19:02
  • @NicolaiF, I just edited the answer to reflect your comment and make it easier for the readers – Geoffrey Negiar Jun 24 '20 at 13:13
  • @GeoffreyNegiar: The last statement `return ex` doesn't seem indented properly. It's not clear to me at what indentation level it should be. – stackoverflowuser2010 Jul 08 '20 at 19:46
  • 3
    @NicolaiF : What does variable 'train' refer to in the line: TEXT.build_vocab(train, max_size=25000, vectors="glove.6B.100d") LABEL.build_vocab(train) ? – John Hawkins Oct 10 '20 at 01:48
  • 1
    @stackoverflowuser2010 I think it is: the Example is returned once all the fields are processed. – Geoffrey Negiar Oct 10 '20 at 08:32
  • @JohnHawkins I'm not sure I remember what it was... Possibly train_df. – Geoffrey Negiar Oct 10 '20 at 08:33
  • How can I sort the values inside the fields object?... When I run ```LABEL.vocab.stoi``` I receive back ```defaultdict(None, {0: 0, 2: 1, 1: 2})``` while it should return ```defaultdict(None, {0: 0, 1: 1, 2: 2})``` – NikSp Feb 01 '22 at 12:10
0

Thanks Geoffrey.

From looking at the source code for torchtext.data.field

https://pytorch.org/text/_modules/torchtext/data/field.html

It looks like the 'train' parameter needs to be either a Dataset already, or some iterable source of text data. But given we haven't created a dataset at this point I am guessing you have passed in just the column of text from the dataframe.

John Hawkins
  • 325
  • 2
  • 9