0

I’m a newbie in DQN and try to understand its coding. I am trying the code below as epsilon greedy action selection but I am not sure how it works

 
    if sample > eps_threshold:
        with torch.no_grad():
           # t.max(1) will return largest      column value of each row.
            # second column on max result is index of where max element was
            # found, so we pick action with the larger expected reward.
            return policy_net(state).max(1)[1].view(1, 1)
    else:
        return   torch.tensor([[random.randrange(n_actions)]], device=device, dtype=torch.long)

Could you please let me know what are indices in max(1)[1] and what is view(1, 1) and it’s indices. Also why “with torch.no_grad():” has been used

John Smith
  • 15
  • 5

1 Answers1

1

When you train a model, torch has to store all the tensors involved in computing the output into a graph, to then be able to make a backward pass during training; this is computationally expensive, and considering that after selecting the action you don't have to train the network, because your only goal here it to pick one using the current weights, then it's just better to use torch.no_grad(). Note that without that part the code would still work the same way, maybe just a bit slower.

About the max(1)[1] part, I'm not really sure how the inputs and outputs are taken considering that there's only a small portion of code here, but I guess that the model takes as input batches of data and outputs a Q-value for each action; then, for each of this outputs you have to take the action that gives you the highest value, so you basically need a max at each row, and that's done by specifying as axis (or dim as torch calls it) the first one, which represents the columns (at every row you take the max of the corresponding columns, which are the actions in this case).

simocasci
  • 11
  • 3