1

I'm new to reinforcement learning, and I would like to process audio signal using this technique. I built a basic step function that I wish to flatten to get my hands on Gym OpenAI and reinforcement learning in general.

To do so, I am using the GoalEnv provided by OpenAI since I know what the target is, the flat signal. That is the image with input and desired signal :

image taken from https://imgur.com/pgdlTWK

The step function calls _set_action which performs achieved_signal = convolution(input_signal,low_pass_filter) - offset, low_pass_filter takes a cutoff frequency as input as well. Cutoff frequency and offset are the parameters that act on the observation to get the output signal. The designed reward function returns the frame to frame L2-norm between the input signal and the desired signal, to the negative, to penalize a large norm.

Following is the environment I created:

def butter_lowpass(cutoff, nyq_freq, order=4):
    normal_cutoff = float(cutoff) / nyq_freq
    b, a = signal.butter(order, normal_cutoff, btype='lowpass')
    return b, a

def butter_lowpass_filter(data, cutoff_freq, nyq_freq, order=4):
    b, a = butter_lowpass(cutoff_freq, nyq_freq, order=order)
    y = signal.filtfilt(b, a, data)
    return y

class `StepSignal(gym.GoalEnv)`:

    def __init__(self, input_signal, sample_rate, desired_signal):
        super(StepSignal, self).__init__()

        self.initial_signal = input_signal
        self.signal = self.initial_signal.copy()
        self.sample_rate = sample_rate
        self.desired_signal = desired_signal
        self.distance_threshold = 10e-1

        max_offset = abs(max( max(self.desired_signal) , max(self.signal))
                 - min( min(self.desired_signal) , min(self.signal)) )

        self.action_space = spaces.Box(low=np.array([10e-4,-max_offset]),\
high=np.array([self.sample_rate/2-0.1,max_offset]), dtype=np.float16)

        obs = self._get_obs()
        self.observation_space = spaces.Dict(dict(
        desired_goal=spaces.Box(-np.inf, np.inf, shape=obs['achieved_goal'].shape, dtype='float32'),
        achieved_goal=spaces.Box(-np.inf, np.inf, shape=obs['achieved_goal'].shape, dtype='float32'),
        observation=spaces.Box(-np.inf, np.inf, shape=obs['observation'].shape, dtype='float32'),
        ))

    def step(self, action):
        range = self.action_space.high - self.action_space.low
        action = range / 2 * (action + 1)
        self._set_action(action)
        obs = self._get_obs()
        done = False

        info = {
                'is_success': self._is_success(obs['achieved_goal'], self.desired_signal),
               }
        reward = -self.compute_reward(obs['achieved_goal'],self.desired_signal)
        return obs, reward, done, info

    def reset(self):
        self.signal = self.initial_signal.copy()
        return self._get_obs()


    def _set_action(self, actions):
        actions = np.clip(actions,a_max=self.action_space.high,a_min=self.action_space.low)
        cutoff = actions[0]
        offset = actions[1]
        print(cutoff, offset)
        self.signal = butter_lowpass_filter(self.signal, cutoff, self.sample_rate/2) - offset

    def _get_obs(self):
        obs = self.signal
        achieved_goal = self.signal
        return {
        'observation': obs.copy(),
        'achieved_goal': achieved_goal.copy(),
        'desired_goal': self.desired_signal.copy(),
        }

    def compute_reward(self, goal_achieved, goal_desired):
        d = np.linalg.norm(goal_desired-goal_achieved)
        return d


    def _is_success(self, achieved_goal, desired_goal):
        d = self.compute_reward(achieved_goal, desired_goal)
        return (d < self.distance_threshold).astype(np.float32)

The environment can then be instantiated into a variable, and flattened through the FlattenDictWrapper as advised here https://openai.com/blog/ingredients-for-robotics-research/ (end of the page).

length = 20
sample_rate = 30 # 30 Hz
in_signal_length = 20*sample_rate # 20sec signal
x = np.linspace(0, length, in_signal_length)

# Desired output
y = 3*np.ones(in_signal_length)
# Step signal
in_signal = 0.5*(np.sign(x-5)+9)

env = gym.make('stepsignal-v0', input_signal=in_signal, sample_rate=sample_rate, desired_signal=y)
env = gym.wrappers.FlattenDictWrapper(env, dict_keys=['observation','desired_goal'])
env.reset()

The agent is a DDPG Agent from keras-rl, since the actions can take any values in the continuous action_space described in the environment. I wonder why the actor and critic nets need an input with an additional dimension, in input_shape=(1,) + env.observation_space.shape

nb_actions = env.action_space.shape[0]

# Building Actor agent (Policy-net)
actor = Sequential()
actor.add(Flatten(input_shape=(1,) + env.observation_space.shape, name='flatten'))
actor.add(Dense(128))
actor.add(Activation('relu'))
actor.add(Dense(64))
actor.add(Activation('relu'))
actor.add(Dense(nb_actions))
actor.add(Activation('linear'))
actor.summary()

# Building Critic net (Q-net)
action_input = Input(shape=(nb_actions,), name='action_input')
observation_input = Input(shape=(1,) + env.observation_space.shape, name='observation_input')
flattened_observation = Flatten()(observation_input)
x = Concatenate()([action_input, flattened_observation])
x = Dense(128)(x)
x = Activation('relu')(x)
x = Dense(64)(x)
x = Activation('relu')(x)
x = Dense(1)(x)
x = Activation('linear')(x)
critic = Model(inputs=[action_input, observation_input], outputs=x)
critic.summary()

# Building Keras agent
memory = SequentialMemory(limit=2000, window_length=1)
policy = BoltzmannQPolicy()
random_process = OrnsteinUhlenbeckProcess(size=nb_actions, theta=0.6, mu=0, sigma=0.3)
agent = DDPGAgent(nb_actions=nb_actions, actor=actor, critic=critic, critic_action_input=action_input,
                  memory=memory, nb_steps_warmup_critic=2000, nb_steps_warmup_actor=10000,
                  random_process=random_process, gamma=.99, target_model_update=1e-3)
agent.compile(Adam(lr=1e-3, clipnorm=1.), metrics=['mae'])

Finally, the agent is trained:

filename = 'mem20k_heaviside_flattening'
hist = agent.fit(env, nb_steps=10, visualize=False, verbose=2, nb_max_episode_steps=5)
with open('./history_dqn_test_'+ filename + '.pickle', 'wb') as handle:
        pickle.dump(hist.history, handle, protocol=pickle.HIGHEST_PROTOCOL)
        agent.save_weights('h5f_files/dqn_{}_weights.h5f'.format(filename), overwrite=True)

Now here is the catch: the agent seems to always be stuck to the same neighborhood of output values across all episodes for a same instance of my env:

image taken from https://imgur.com/kaKhZNF `

The cumulated reward is negative since I just allowed the agent to get negative rewards. I used it from https://github.com/openai/gym/blob/master/gym/envs/robotics/fetch_env.py which is part of OpenAI code as example. Across one episode, I should get varying sets of actions converging towards a (cutoff_final, offset_final) that would get my input step signal close to my output flat signal, which is clearly not the case. In addition, I thought, for successive episodes, I should get different actions.

questionasker
  • 2,536
  • 12
  • 55
  • 119
Post. T.
  • 79
  • 5

1 Answers1

0

I wonder why the actor and critic nets need an input with an additional dimension, in input_shape=(1,) + env.observation_space.shape

I think the GoalEnv is designed with HER (Hindsight Experience Replay) in mind, since it will use the "sub-spaces" inside the observation_space to learn from sparse reward signals (there is a paper in OpenAI website that explains how HER works). Haven't look at the implementation, but my guess is that there needs to be an additional input since HER also process the "goal" parameter.

Since it seems you are not using HER (works with any off-policy algorithm, including DQN, DDPG, etc), you should handcraft an informative reward function (rewards are not binary, eg, 1 if objective achieved, 0 otherwise) and use the base Env class. The reward should be calculated inside the step method, since rewards in MDP's are functions like r(s, a, s`) you probably will have all the information you need. Hope it helps.

João Pedro
  • 167
  • 6