Why use Crossover in Neural Network training?

Question

Why specifically is it used?

I know it increases variation which may help explore the problem space, but how much does it increase the probability of finding the optimal solution/configuration in time? And does it do anything else advantageous?

And does it necessarily always help, or are there instances in which it would increase the time taken to find the optimal solution?

Say goal is `x + y`, and an individual is made up by `x, y` pairs, with crossover you may get a very good child from a pair of parents s.t. in the first `x >> y` and in the second `y >> x`. Without crossover, there is no interchange of genome among offspring of first individual and the offspring of second individual.. which mean that the two family branches would have to improve the smaller value independently from one another, via random mutations. — Patrick Trentin, Feb 15 '18 at 15:35
Crossover does not increase the probability of finding the optimal solution per se, but it can help increasing the speed at which the search converges at a potential optimal solution. This of course depends on the particular application, on the fitness function being used and the genetic algorithm being employed. — Patrick Trentin, Feb 15 '18 at 15:39
The end goal in GA is to bring the population to convergence. Crossover is a convergence operation, it drives the population towards a local min/max. (Mutation on the other side is a divergence operation its used to discover a better minimum/maximum space). — Panos Kalatzantonakis, Feb 16 '18 at 11:23
Very good question. And apparently, still open to debate. In my experiments, crossover is an unnecessary complication. But I advise people to just try it with/without crossover and judge for themselves. — Ray, Feb 17 '18 at 06:42
The likelihood of finding the optimal solution is unchanged because genetic algorithms are by construction infinite heuristic procedures which don't provide any guarantees although they work well in practice. I disagree with @PanosKal. that crossover is a convergence operator: it's the selection step that makes population converge towards the minima/maxima, crossover is a redistribution operator of genome among individuals that takes jumps within the genome abstraction. Mutation jumps are random and can cover whole space, crossover jumps are directed and limited to local space — Patrick Trentin, Feb 18 '18 at 08:20
Well, the selection of the parameters depends on the task at hand. Some problems have fitness landscapes that comes straight from hell. Anyway @PatrickTrentin excluding my experience on the field you do no disagree with only me but with the bibliography of many many researchers. — Panos Kalatzantonakis, Feb 18 '18 at 11:09
@PanosKal. with all due respects, the only operation that consistently decreases the diversity in the genetic pool is selection, so that's the only operation that makes the overall population converge. You are welcome to cite your bibliography, that won't change this simple mathematical fact. — Patrick Trentin, Feb 18 '18 at 12:10

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

As Patrick Trentin said, crossover improve the speed of convergence, because it allows to combine good genes that are already found in the population.

But, for neuro-evolution, crossover is facing the "permutation problem", also known as "the competing convention problem". When two parents are permutations of the same network, then, except in rare cases, their offspring will always have a lower fitness. Because the same part of the network is copied in two different locations, and so the offspring is losing viable genes for one of these two locations.

for example the networks A,B,C,D and D,C,B,A that are permutations of the same network. The offspring can be:

A,B,C,D (copy of parent 1)           
D,C,B,A (copy of parent 2)

A,C,B,D OK
A,B,C,A
A,B,B,A 
A,B,B,D
A,C,B,A
A,C,C,A
A,C,C,D

D,B,C,A OK
D,C,B,D
D,B,B,A 
D,B,B,D
D,B,C,D
D,C,C,A
D,C,C,D

So, for this example, 2/16 of the offspring are copies of the parents. 2/16 are combinations without duplicates. And 12/16 have duplicated genes.

The permutation problem occurs because networks that are permutations one of the other have the same fitness. So, even for an elitist GA, if one is selected as parent, the other will also often be selected as parent.

The permutations may be only partial. In this case, the result is better than for complete permutations, but the offspring will, in a lot of cases, still have a lower fitness than the parents.

To avoid the permutation problem, I heard about similarity based crossover, that compute similarity of neurons and their connected synapses, doing the crossing-over between the most similar neurons instead of a crossing-over based on the locus.

When evolving topology of the networks, some NEAT specialists think the permutation problem is part of a broader problem: "the variable lenght genome problem". And NEAT seems to avoid this problem by speciation of the networks, when two networks are too differents in topology and weights, they aren't allowed to mate. So, NEAT algorithm seems to consider permuted networks as too different, and doesn't allow them to mate.

A website about NEAT also says:

However, in another sense, one could say that the variable length genome problem can never be "solved" because it is inherent in any system that generates different constructions that solve the same problem. For example, both a bird and a bat represent solutions to the problem of flight, yet they are not compatible since they are different conventions of doing the same thing. The same situation can happen in NEAT, where very different structures might arise that do the same thing. Of course, such structures will not mate, avoiding the serious consequence of damaged offspring. Still, it can be said that since disparate representations can exist simultaneously, incompatible genomes are still present and therefore the problem is not "solved." Ultimately, it is subjective whether or not the problem has been solved. It depends on what you would consider a solution. However, it is at least correct to say, "the problem of variable length genomes is avoided."

Edit: To answer your comment.

You may be right for similarity based crossover, I'm not sure it totally avoids the permutation problem.

About the ultimate goal of crossover, without considering the permutation problem, I'm not sure it is useful for the evolution of neural networks, but my thought is: if we divide a neural network in several parts, each part contributes to the fitness, so two networks with a high fitness may have different good parts. And combining these parts should create an even better network. Some offspring will of course inherit the bad parts, but some other offspring will inherit the good parts.

Like Ray suggested, it could be useful to experiment the evolution of neural networks with and without crossover. As there is randomness in the evolution, the problem is to run a large number of tests, to compute the average evolution speed.

About evolving something else than a neural network, I found a paper that says an algorithm using crossover outperforms a mutation-only algorithm for solving the all-pairs shortest path problem (APSP).

Edit 2:

Even if the permutation problem seems to be only applicable to some particular problems like neuro-evolution, I don't think we can say the same about crossover, because maybe we are missing something about the problems that don't seem to be suitable for crossover.

I found a free version of the paper about similarity based crossover for neuro-evolution, and it shows that:

an algorithm using a naive crossover performs worse than a mutation-only algorithm.
using similarity based crossover it performs better than a mutation-only algorithm for all tested cases.
NEAT algorithm sometimes performs better than a mutation-only algorithm.

Crossover is complex and I think there is a lack of studies that compare it with mutation-only algorithms, maybe because its usefulness highly depends:

of its engineering, in function of particular problems like the permutation problem. So of the type of crossover we use (similarity based, single point, uniform, edge recombination, etc...).
And of the mating algorithm. For example, this paper shows that a gendered genetic algorithm strongly outperforms a non-gendered genetic algorithm for solving the TSP. For solving two other problems, the algorithm doesn't strongly outperforms, but it is better than the non-gendered GA. In this experiment, males are selected on their fitness, and females are selected on their ability to produce a good offspring. Unfortunately, this study doesn't compare the results with a mutation-only algorithm.

I edited my answer to talk about the ultimate goal of crossover. — Bertrand125, Feb 18 '18 at 06:35
I think the most important part of this answer is where you highlight the need of engineering crossover around the specific genome being used, **i.e.** as you say, in the case of neural networks one should use a [similarity based crossover](https://www.sciencedirect.com/science/article/pii/S0925231213006929). Here the problem is not the crossover operator, but the fact that it is applied blindly.. so the real question is: is there a better way to encode neural networks within a GA, so they don't break the structure of NN? -- +1 — Patrick Trentin, Feb 18 '18 at 08:03
I made a second edit to my answer, because my text is too long for a comment. — Bertrand125, Feb 18 '18 at 23:04

Why use Crossover in Neural Network training?

1 Answers1

Edit: To answer your comment.

Edit 2: