Difficult reinforcement learning query

Question

I'm struggling to figure out how I want to do this so I hope someone here may offer some guidance.

Scenario - I have a 10 character string, lets call it the DNA, made up of the following characters:

F
-
+
[
]
X

for example DNA = ['F', 'F', '+', '+', '-', '[', 'X', '-', ']', '-']

Now these DNA strings get converted to physical representations from whence I can get a fitness or reward value. So an RL flowchart for this scenario would look like this:

P.S. The maximum fitness is not known or specified.

Step 1: Get random DNA string

Step 2: Compute fitness

Step 3: Get another random DNA string

Step 4: Compute fitness

Step 5: Compute gradient and see which way is up

Step 6: Train ML algorithm to generate better and better DNA strings until fitness no longer increases

For clarity sake the best DNA string, i.e. the one who will return the highest fitness, for my purposes now is:
['F', 'X', 'X', 'X', 'X', 'F', 'X', 'X', 'X', 'X']

How can I train a ML algorithm to learn this and output this DNA string?

I'm trying to wrap my brain around Policy Gradient methods but what will my input to the ML algorithm be? There are no states like in the OpenAI Gym examples.

EDIT: Final goal - Algorithm that learns to generate higher fitness value DNA strings. This has to happen without any human supervision i.e. NOT supervised learning but reinforcement learning.

Akin to a GA that will evolve better and better DNA strings

What is the end goal of the experiment? Do you want to generate DNA strings with better fitness values? — nsidn98, Jul 25 '19 at 10:33
Yes. The network must come to understand the relationship between the DNA and the fitness and then generated optimal DNA strings dependent on the fitness function. — Izak Joubert, Jul 25 '19 at 10:38
As far as I have understood your problem, there is no input. You just need blackbox to generate DNA strings which have a higher fitness value. Please correct me if I am wrong. The problem becomes interesting if you want to train a model which learns to mutate a given string into another string with a higher fitness function. Please add more information about the exact problem you want to solve. — nsidn98, Jul 25 '19 at 12:05

score 2 · Accepted Answer · answered Jul 26 '19 at 08:45

Assuming that the problem is to mutate a given string into another string which has a higher fitness value, the Markov Decision Process can be modeled as:

Initial State: A random DNA string.
Action: Mutate into another string which is similar to the original one but (ideally) with a higher fitness value.
State: the strings generated by the agent
Done Signal: When more than 5 (can set to any value) characters are changed in the original random string at the start of the episode.
Reward: fitness(next_state) - fitness(state) + similarity(state,next_state) OR fitness(next_state) - fitness(state)

You could start with Q-learning with discrete actions of dimension:10 and each action having 6 choices: (F, -, +, [, ], X)

I'd thought about making each character position in the 10 character string a 'state' to implement Q-learning. I will try that thanks. — Izak Joubert, Jul 26 '19 at 09:43

Difficult reinforcement learning query

1 Answers1