I'm trying to implement a Q-learning based shortest path algorithm. However, sometimes I'm not getting the same path as the classic shortest path algorithm based on the same origin and destination. Here is how I've modeled the algorithm:
- Environment: is a Direct weighted graph G=(V, E)
- State: current vertex in the graph
- Action: successor vertices of the current vertex in the graph
- Reward: the weight of the edge to the successor vertex
- Epsisode: process to reach the target destination from a specific origin
I've already tried larger numbers of episodes (such as 1.000.000) and different values of learning rate and discount factor, but still it seems to not be converging. Here is a link for my code: https://colab.research.google.com/drive/1Z84t5_W5wxkX7eXnWp8CdxqhLXMFYzf4?usp=sharing
Does anyone has any idea in what I'm doing wrong or what should I do to avoid such an issue?