0

I am currently working on my master's thesis on the use of deep learning models in recommender systems. I am using the ml-20m dataset, but for testing and debugging purposes, I am scaling down to ml-100k. I have implemented the RBM almost from scratch using pytorch.

The architecture of the network is as follows:

The visible layer is an N x 10 matrix of one-hot encoded movie ratings, where N is the number of movies in the dataset. Each user's ratings are one training example and there are 10 possible ratings for every movie.

The hidden layer is consisted of M binary hidden units.

The activation function of the hidden units is a sigmoid, while the activation of the visible units is a 10-way softmax.

The loss function I am using is the reconstruction error, in my case RMSE.

The problem is that my RBM is incapable of overfitting a single batch of 10 user's ratings. The larger the batch size, the higher the value of the loss function at which it plateaus. I have no problem overfitting a single training example.

Any ideas as to why this might happen. I will supply the essential functions used in the training process. Thanks in advance!

Here are the forward pass and backward pass implementations:

def forward_pass(
        self,
        v: torch.Tensor,
        activation=torch.sigmoid,
        sampler=torch.bernoulli,
    ) -> Tuple[torch.Tensor, torch.Tensor]:

        a = torch.mm(v.flatten(-2), self.w.flatten(end_dim=1))
        a = self.h + a
        ph = activation(a)
        return ph, sampler(ph)


def backward_pass(
        self,
        h: torch.Tensor,
        activation=ratings_softmax,
    ) -> torch.Tensor:

        a = torch.matmul(self.w, h.t())
        pv = self.v.unsqueeze(2) + a
        pv = activation(pv.permute(2, 0, 1))
        return pv

The implementation of the Gibbs sampler:

 def gibbs_sample(
        self, input: torch.Tensor, t: int = 1
    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
       
        ph0, h0 = self.forward_pass(input)
        hk = phk = h0

        # do Gibbs sampling for t steps
        for i in range(t):
            vk = self.backward_pass(hk)
            phk, hk = self.forward_pass(vk)

        vk[input.sum(dim=2) == 0] = input[input.sum(dim=2) == 0]

        return input, ph0, vk, phk

And finaly the apply_gradient method that equates to one training step:


def apply_gradient(
        self, minibatch: torch.Tensor, t: int = 1, decay=lambda x: x
    ) -> Tuple[float, float]

        v0 = minibatch

        v0, ph0, vt, pht = self.gibbs_sample(v0, t)

        hb_delta = (ph0 - pht).sum(dim=0) / len(minibatch)
        vb_delta = (v0 - vt).sum(dim=0) / len(minibatch)

        w_delta = torch.matmul(vb_delta.unsqueeze(2), hb_delta.unsqueeze(0))

        # apply learning rate decay
        self.alpha = decay(self.learning_rate)

        # update the parameters of the model
        self.v += vb_delta * self.alpha
        self.h += hb_delta * self.alpha
        self.w += w_delta * self.alpha


        se, ae, n = self.batch_error(minibatch)
        rmse = math.sqrt(se / n)
        mae = ae / n

        return rmse, mae

0 Answers0