I am currently working on my master's thesis on the use of deep learning models in recommender systems. I am using the ml-20m dataset, but for testing and debugging purposes, I am scaling down to ml-100k. I have implemented the RBM almost from scratch using pytorch.
The architecture of the network is as follows:
The visible layer is an N x 10 matrix of one-hot encoded movie ratings, where N is the number of movies in the dataset. Each user's ratings are one training example and there are 10 possible ratings for every movie.
The hidden layer is consisted of M binary hidden units.
The activation function of the hidden units is a sigmoid, while the activation of the visible units is a 10-way softmax.
The loss function I am using is the reconstruction error, in my case RMSE.
The problem is that my RBM is incapable of overfitting a single batch of 10 user's ratings. The larger the batch size, the higher the value of the loss function at which it plateaus. I have no problem overfitting a single training example.
Any ideas as to why this might happen. I will supply the essential functions used in the training process. Thanks in advance!
Here are the forward pass and backward pass implementations:
def forward_pass(
self,
v: torch.Tensor,
activation=torch.sigmoid,
sampler=torch.bernoulli,
) -> Tuple[torch.Tensor, torch.Tensor]:
a = torch.mm(v.flatten(-2), self.w.flatten(end_dim=1))
a = self.h + a
ph = activation(a)
return ph, sampler(ph)
def backward_pass(
self,
h: torch.Tensor,
activation=ratings_softmax,
) -> torch.Tensor:
a = torch.matmul(self.w, h.t())
pv = self.v.unsqueeze(2) + a
pv = activation(pv.permute(2, 0, 1))
return pv
The implementation of the Gibbs sampler:
def gibbs_sample(
self, input: torch.Tensor, t: int = 1
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
ph0, h0 = self.forward_pass(input)
hk = phk = h0
# do Gibbs sampling for t steps
for i in range(t):
vk = self.backward_pass(hk)
phk, hk = self.forward_pass(vk)
vk[input.sum(dim=2) == 0] = input[input.sum(dim=2) == 0]
return input, ph0, vk, phk
And finaly the apply_gradient method that equates to one training step:
def apply_gradient(
self, minibatch: torch.Tensor, t: int = 1, decay=lambda x: x
) -> Tuple[float, float]
v0 = minibatch
v0, ph0, vt, pht = self.gibbs_sample(v0, t)
hb_delta = (ph0 - pht).sum(dim=0) / len(minibatch)
vb_delta = (v0 - vt).sum(dim=0) / len(minibatch)
w_delta = torch.matmul(vb_delta.unsqueeze(2), hb_delta.unsqueeze(0))
# apply learning rate decay
self.alpha = decay(self.learning_rate)
# update the parameters of the model
self.v += vb_delta * self.alpha
self.h += hb_delta * self.alpha
self.w += w_delta * self.alpha
se, ae, n = self.batch_error(minibatch)
rmse = math.sqrt(se / n)
mae = ae / n
return rmse, mae