I can't use my custom loss which inherited from torch.autograd.Function. RuntimeError: A view was created in no_grad mode

Question

I'm using torch.autograd.Function to create custom loss function and I got this error when running the training part.

RuntimeError: A view was created in no_grad mode and its base or another view of its base has been modified inplace with grad mode enabled. Given that this use case is ambiguous and error-prone, it is forbidden. You can clarify your code by moving both the view and the inplace either both inside the no_grad block (if you don't want the inplace to be tracked) or both outside (if you want the inplace to be tracked).

Loss function

class MaskedHuberLoss(torch.autograd.Function):

    @staticmethod
    def forward(ctx, outputs, targets, masks):
        ''' Computes the loss over a batch of neural net outputs and targets.
        Params:
            outputs: an NxM tensor containing N vectors of values over buckets,
                output by the neural net
            targets: an NxM tensor containing N vectors of actual values over
                buckets, produced by @{data_generation_call}
            mask: an NxM tensor containing N mask vectors generated with
                @{bucket_conversion.get_possible_bucket_mask}
        Return the sum of Huber loss applied elementwise on `outputs` and `targets`,
        masked so that only valid buckets are included'''
        # 0.0 reshape 
        outputs = outputs.squeeze(1)
        targets = targets.squeeze(1)
        masks = masks.squeeze(1)

        batch_size = outputs.size(0)
        feature_size = outputs.size(1)

        # 1.0 zero out the outputs/target so that the error does not depend on these
        outputs.mul_(masks)
        targets.mul_(masks)
        
        loss = smoothL1LossForward(outputs, targets)
        
        # 2.0 if the batch size has changed, create new storage for the sum, otherwise reuse
        mask_placeholder = torch.zeros_like(masks).to(device)
        mask_sum = torch.FloatTensor(batch_size).fill_(0).to(device)
        mask_multiplier = mask_sum.clone().fill_(0).view(-1, 1).to(device)
        
        print("mask_placeholder",mask_placeholder.shape)
        print("masks",masks.shape)

        # 3.0 compute mask sum for each batch
        mask_placeholder.copy_(masks)
        mask_sum = mask_placeholder.sum(dim=1, keepdim=True)
        

        # 3.1 mask multiplier - note that mask is 1 for impossible features
        mask_multiplier.fill_(feature_size)
        mask_multiplier.sub_(mask_sum)
        mask_multiplier.div_(feature_size)
        
        # 4.0 multiply to get a new losss
        # loss is not really computed batch-wise correctly,
        # but that does not really matter now since gradients are correct
        loss_multiplier = (batch_size * feature_size) / (batch_size * feature_size - mask_sum.sum() )
        new_loss = loss_multiplier * loss

        ctx.save_for_backward(outputs, targets, mask_multiplier)
        
        return new_loss

    @staticmethod
    def backward(ctx, grad_out):
        ''' Computes the gradient of the loss function @{forward} with
        arguments `outputs`, `targets`, and `mask`.
        Must be called after a @{forward} call with the same arguments.
        Params:
            outputs: an NxM tensor containing N vectors of values over buckets,
                output by the neural net
            targets: an NxM tensor containing N vectors of actual values over
                buckets, produced by @{data_generation_call}
            mask: an NxM tensor containing N mask vectors generated with
                @{bucket_conversion.get_possible_bucket_mask}
        Return the gradient of @{forward} applied to the arguments'''
        outputs, targets, mask_multiplier = ctx.saved_tensors
        dloss_doutput = smoothL1LossGrad(outputs, targets)
        
        # we use the multiplier computed with the mask during forward call
        dloss_doutput.div_(mask_multiplier.expand_as(dloss_doutput))
        
        return dloss_doutput, None, None

Training code

train_losses = []
val_losses = []

for epoch in range(config['epochs']):  # loop over the dataset multiple times
    
    # Training
    train_loss = []
    current_lr = optimizer.param_groups[0]['lr']

    # Flag model as training. 
    baseline_model.train()

    print(f"Training epoch {epoch+1}...")
    print(f"Current LR: {current_lr}")

    for i, (inputs, targets, masks) in enumerate(tqdm(train_dataloader)):
        # Transfer data from cpu to gpu
        inputs = inputs.to(device)
        targets = targets.to(device)

        # Reset the gradient
        optimizer.zero_grad()

        # Predict
        y_pred = baseline_model(inputs)

        # Calculate loss
        loss = loss_function(y_pred, targets, masks)
        print(loss)

        # Compute gradient
        loss.backward()
        
        # Update parameters
        optimizer.step()

        # Log stuff
        train_loss.append(loss)
        
    avg_train_loss = torch.stack(train_loss).mean().item()
    train_losses.append(avg_train_loss)

    print(f"Epoch {epoch+1} train loss: {avg_train_loss:.4f}")
    
    # Validation
    baseline_model.eval()
    with torch.no_grad(): # No gradient is required during validation
        print(f"Validating epoch {epoch+1}")
        val_loss = []
        for i, (inputs, y_true, masks) in enumerate(tqdm(val_dataloader)):
            # Transfer data from cpu to gpu
            inputs = inputs.to(device)
            targets = targets.to(device)
            
            # Predict
            y_pred = baseline_model(inputs)

            # Calculate loss
            loss = loss_function(y_pred, y_true, masks)

            # Log stuff
            val_loss.append(loss)
        
        avg_val_loss = torch.stack(val_loss).mean().item()
        val_losses.append(avg_val_loss)
        print(f"Epoch {epoch+1} val loss: {avg_val_loss:.4f}")

        # LR adjustment with scheduler
        scheduler.step(avg_val_loss)

        # Save checkpoint if val_loss is the best we got
        best_val_loss = np.inf if epoch == 0 else min(val_losses[:-1])
        if avg_val_loss < best_val_loss:
            # Save whatever you want
            state = {
                'epoch': epoch,
                'model': baseline_model.state_dict(),
                'optimizer': optimizer.state_dict(),
                'scheduler': scheduler.state_dict(),
                'train_loss': avg_train_loss,
                'val_loss': avg_val_loss,
                'best_val_loss': best_val_loss,
            }

Hatem · Answer 1 · 2022-05-27T16:49:37.080

I think what needs to be done is to clone your inputs first before doing any in-place operations inside your loss function, have you tried that?

class MaskedHuberLoss(torch.autograd.Function):

    @staticmethod
    def forward(ctx, outputs, targets, masks):
        ''' Computes the loss over a batch of neural net outputs and targets.
        Params:
            outputs: an NxM tensor containing N vectors of values over buckets,
                output by the neural net
            targets: an NxM tensor containing N vectors of actual values over
                buckets, produced by @{data_generation_call}
            mask: an NxM tensor containing N mask vectors generated with
                @{bucket_conversion.get_possible_bucket_mask}
        Return the sum of Huber loss applied elementwise on `outputs` and `targets`,
        masked so that only valid buckets are included'''
        # 0.0 reshape 
        # clone
        outputs = outputs.clone().squeeze(1)
        targets = targets.clone().squeeze(1)
        masks = masks.clone().squeeze(1)

        batch_size = outputs.size(0)
        feature_size = outputs.size(1)

        # 1.0 zero out the outputs/target so that the error does not depend on these
        outputs.mul_(masks)
        targets.mul_(masks)
        
        loss = smoothL1LossForward(outputs, targets)
        
        # 2.0 if the batch size has changed, create new storage for the sum, otherwise reuse
        mask_placeholder = torch.zeros_like(masks).to(device)
        mask_sum = torch.FloatTensor(batch_size).fill_(0).to(device)
        mask_multiplier = mask_sum.clone().fill_(0).view(-1, 1).to(device)
        
        print("mask_placeholder",mask_placeholder.shape)
        print("masks",masks.shape)

        # 3.0 compute mask sum for each batch
        mask_placeholder.copy_(masks)
        mask_sum = mask_placeholder.sum(dim=1, keepdim=True)
        

        # 3.1 mask multiplier - note that mask is 1 for impossible features
        mask_multiplier.fill_(feature_size)
        mask_multiplier.sub_(mask_sum)
        mask_multiplier.div_(feature_size)
        
        # 4.0 multiply to get a new losss
        # loss is not really computed batch-wise correctly,
        # but that does not really matter now since gradients are correct
        loss_multiplier = (batch_size * feature_size) / (batch_size * feature_size - mask_sum.sum() )
        new_loss = loss_multiplier * loss

        ctx.save_for_backward(outputs, targets, mask_multiplier)
        
        return new_loss

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). — Community, May 27 '22 at 15:23
Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). — Community, May 28 '22 at 04:29

I can't use my custom loss which inherited from torch.autograd.Function. RuntimeError: A view was created in no_grad mode

1 Answers1