Q-learning based Shortest Path algorithm

Question

I'm trying to implement a Q-learning based shortest path algorithm. However, sometimes I'm not getting the same path as the classic shortest path algorithm based on the same origin and destination. Here is how I've modeled the algorithm:

Environment: is a Direct weighted graph G=(V, E)
State: current vertex in the graph
Action: successor vertices of the current vertex in the graph
Reward: the weight of the edge to the successor vertex
Epsisode: process to reach the target destination from a specific origin

I've already tried larger numbers of episodes (such as 1.000.000) and different values of learning rate and discount factor, but still it seems to not be converging. Here is a link for my code: https://colab.research.google.com/drive/1Z84t5_W5wxkX7eXnWp8CdxqhLXMFYzf4?usp=sharing

Does anyone has any idea in what I'm doing wrong or what should I do to avoid such an issue?

Hello. I just want to let you know that there's the site [Artificial Intelligence SE](https://ai.stackexchange.com/), where there are many RL enthusiasts. However, note that we do not generally accept programming questions, but only conceptual ones. — nbro, Sep 25 '20 at 21:25

Q-learning based Shortest Path algorithm

0 Answers0