How can one trace the memory allocation for the autograd graph created by the forward pass on cpu? For instance trying to use tracemalloc on cpu:
rnn=nn.RNNCell(100,100).to('cuda')
x=torch.ones((1000,100),device='cuda')
tracemalloc.start(25)
while True:
print(tracemalloc.get_traced_memory())
x=rnn(x)
The printed memory should continually increase as the graph is increasing in each loop step, but the printed memory from
tracemalloc.get_traced_memory()
remains constant after the 3rd loop. What is going on?