Genetic algorithm for feature selection

Question

I am doing project on heart disease prediction system. here am using "Cleveland Heart Disease Dataset" which contains 13 attributes

Sex
Chest Pain Type
Fasting Blood Sugar
Restecg – resting electrographic results
Exang – exercise induced angina
Slope - the slope of the peak exercise ST segment
CA – number of major vessels colored by fluorscopy
Thal
Trest Blood Pressure
Serum Cholesterol
Thalach – maximum heart rate achieved
Oldpeak – ST depression induced by exercise relative to rest
Age

I found a paper where they applied genetic algorithm for this purpose and selected the following attributes

Type - Chest Pain Type
Rbp - Resting blood pressure
Eia - Exercise induced angina
Oldpk - Old peak
Vsl - No. of vessels colored
Thal -Maximum heart rate achieved)

However, they didn't mention about the criteria they used to find the fittest attributes(fitness function). Since I am new to this concept I don't have any idea of how to perform the task. Can anyone help me?

I would email the authors and ask them what fitness function they used. There are many ways to combine features for fitness in a GA, and the fitness function you use may well affect the features you end up selecting. — Timothy Jones, Jan 10 '12 at 06:33
[This question may be useful for you](http://stackoverflow.com/questions/7992862/genetic-algorithms-fitness-function-for-feature-selection-algorithm) — Timothy Jones, Jan 10 '12 at 06:36

helios · Accepted Answer · 2012-01-10T13:56:23.810

Defining the population and its representation

The candidates (population of GA) are the different subsets of attributes. Each subset can be a good set of attributes related to hearth disease or not.

So I understand you have data with different measures for the attributes and an indicator of the measured person having hearth disease or not.

You can easily represent a subset of attributes using a bit for each atribute. So 10000000000000 will be the subset with only the first attribute. 11000... only the two first... and so on.

Find the fitness function

How to say if a candidate (subset of attributes) is a good or bad indicator for hearth disease. I would say it's good if its directly correlated with the disease. So for all the patients with high numbers on that indicators they have disease, and for all the patients with low numbers the don't have the disease.

TODO: find a correlation measure... :) (I'll edit the answer)

A subset with more indicators than necesary is bad. So you have to score worse if an attribute from the subset is NOT correlated.

TODO: find a way to introduce this.

Two directions

Also, I will have into account the two directions. By example an attribute can be related with hearth disease if it has a low number. So I will use 26 bits. Two bits for eachs indicator. One using the attribute value, and other the negative one.

Finding a fitness measure

With the statistical data you could tell if an arbitrary set of attributes is good for finding hearth disease or not.

Each patient will be first, second, and so on according to each attribute. By example blood pressure. The one with less pressure will be the first, the one with more pressure will be the last.

So if blood pressure is highly related, those with high values will have disease while those with low pressure will not have.

So a good score for a set of attributes is how many correct diagnoses you could do based on data you have. If you have attributes A and B, their score as good indicators will increase with the number of patients with high numbers and hearth disease (related), and will decrease with the number of patients with low numbers and hearth disease (unrelated or contradictory).

For an only attribute

I can order patients based on that attribute. Then I can see which of them have disease. If those with higher numbers (to the right of the ordering) have disease, then its related. Otherwise not.

If I obtain:

ND ND ND ND ND D D D D D D

ND = no disease
D = disease

It's very very related.

So the score for me will be how ordered is the ND/D value, after ordering the patients by their value on that attribute.

For a set of attributes

Of course you have to give a score for a set of attributes (let's say, the first three attributes of the list). So I should first order patients by each one of them:

Ordered by -> Attr1, Attr2, Attr3

Patient1       1st    3rd    10th
Patient2       2nd    11th   2nd
Patient3       6th    1st    3rd

And then sum the positions for each patient:

Ordered by -> Attr1, Attr2, Attr3

Patient1       1st    3rd    10th -> 1+3+10 = 14
Patient2       2nd    11th   2nd -> 2 + 11 + 2 = 15
Patient3       6th    1st    3rd -> 6+1+3 = 10

And then order the patients by that sum.

P3, P1, P2

Then if their disease status is highly ordered (those with disease are on the right), the the score is high.

By example:

ND ND D -> only patient 2 has disease, highly correlated
D D ND -> patients 3 and 1 has disease, doesn't seem correlated (in fact, it seems contradictory)

So the last part for defining an scoring method is find a way to say if a sequence of bits is ordered or not:

ND ND ND ND D D D D D D -> high score
D ND D ND D ND D ND D ND -> low score

Hope it helps! :)

Oops, also, for boolean attributes like sex you could use 0/1. For non-scalar values like chest-pain type, well, maybe you can make different boolean attributes like ¿has-pain-1? ¿has-pain-2? and so on. — helios, Jan 10 '12 at 13:58

score 0 · Answer 2 · answered Jan 10 '12 at 08:03

0

Since you are the researcher, you should really be able to say what you are trying to achieve. The "Fitness" is how closely a solution matches what you are trying to achieve. e.g. "Fitness" in this cause could be a function which most closely matches the prediction.

answered Jan 10 '12 at 08:03

Peter Lawrey

525,659
79
751
1,130

thank yo so much for the reply. that is what am trying to figure out. i don't have an idea of how to develop a fitness function . if you give me an idea that will be helpfull for me to proceed – darsha Jan 10 '12 at 09:06
To depend a fitness function, you need to be able describe what it is you are looking for. Only you can do that. – Peter Lawrey Jan 10 '12 at 09:07

score 0 · Answer 3 · edited May 23 '17 at 11:59

To find out what fitness function the other authors used, you can always email them.

There are many ways to combine features for fitness in a GA, and the fitness function you use will affect the features you end up selecting. So if you want to achieve the same combination of features as another group of authors, I'd just ask them. Most scientists are very helpful to others interested in their work.

In my experience, sometimes you might not get a reply - so don't feel bad about asking again if you have to. Depending on the rules at their institution, they may even have code you can use, but you won't know until you ask.

However, if you just want some way to reduce the number of features in a set, the answer on this question might be helpful.

Genetic algorithm for feature selection

3 Answers3