2

I'm currently loading in my data with one single dataset class. Within the dataset, I split the train, test, and validation data separately. For example:

class Data():
    def __init__(self):
        self.load()

    def load(self):
        with open(file=file_name, mode='r') as f:
            self.data = f.readlines()

        self.train = self.data[:checkpoint]
        self.valid = self.data[checkpoint:halfway]
        self.test = self.data[halfway:]

Many of the details have been omitted for the sake of readability. Basically, I read in one big dataset and make the splits manually.

My question is arising from how to override the __len__ method when the lengths of my train, valid, and test data all differ?

The reason I want to do this is because I want to keep the split data in one single class, and I also want to create separate Dataloaders for each, and so something like:

def __len__(self):
    return len(self.train)

wouldn't be appropriate for self.test and self.valid.

Perhaps I'm fundamentally misunderstanding the Dataloader, but how should I approach this issue? Thanks in advance.

Sean
  • 2,890
  • 8
  • 36
  • 78

1 Answers1

0

I think the most appropriate method to get the length of each split, is to simply use:

# Number of training points
len(self.train)

# Number of testing points
len(self.test)

# Number of validation points
len(self.valid)

Alternatively, if you want to refer to the length of splits for a particular instance of your object:

data = Data()
print(len(data.train))
print(len(data.test))
print(len(data.valid))

__len__ allows you to implement the way you want to count the elements of an object. Therefore, I would implement it as follows, and use the aforementioned calls to get the counts per split:

def __len__(self):
    return len(self.data)
Giorgos Myrianthous
  • 36,235
  • 20
  • 134
  • 156
  • Wouldn't this cause issues when creating my Dataloader objects for each setting? If I define `__len__` in the way that you proposed, then I could simply do `return len(self.data)` rather than adding the three, couldn't I? Perhaps I need to look into it in more depth, but I've never seen an explicit call to the `__len__` method when declaring Dataloader objects. – Sean Dec 08 '19 at 22:31
  • @Seankala I didn't see that you initialise `self.data` too. I've updated my answer. – Giorgos Myrianthous Dec 08 '19 at 22:37