I am trying to implement a "Lazy-MDP" agent in my RL algorithm. My reference for this is [Lazy-MDP].(https://arxiv.org/pdf/2203.08542.pdf#:~:text=A%20lazy-MDP%20is%20a%20tupleM%2B%3D%20%28M%2C%C2%AFa%2C%C2%AF%CF%80%2C%20%CE%B7%29%2C%20whereM%3D,action%20that%20defers%20decisionmaking%20to%20the%20default%20policy%CF%80%E2%88%88%C2%AF%E2%88%86S) However, I am using a PPO implementation for this problem with an actor-critic policy. Thus I have a estimate for the state-value (critic) and can choose actions from my actor. To decide when to choose the lazy-action, I have to calculate the lazy-gap: Lazy-gap Formula. However, this is based upon state-action value estimates. Thanks in advance for anyone that can help me further.
Since I need to make this decision in state t, I cannot use information in t+1. I have tried deriving this in terms of V(s), but thus far am unable to. As every derivation uses t+1 information.