What Machine Learning Algorithm would be appropriate for this scenario

Question

I have a PHP/MySQL application that stores symptoms and the appropriate drug. What machine learning algorithm should I use to predict the drug for any symptoms. Also, what would be the format of the training set?

You might bet better answers on http://stats.stackexchange.com/ — Eamon Nerbonne, Feb 21 '13 at 14:09
I took the liberty of removing the MySQL tag because this has nothing to do with databases. — Gordon Linoff, Feb 21 '13 at 14:31
Are you asking about the best algorithm or which software to use? — Gordon Linoff, Feb 21 '13 at 14:31
I asked "What machine learning algorithm should I use". So, it's the best algorithm I'm asking, not software! — tank, Feb 21 '13 at 15:20
Definitely don't use PHP and MySQL, though. Neither is good at *number crunching*. — Has QUIT--Anony-Mousse, Feb 21 '13 at 16:33
It depends. You should try to find check the performances of algorithms i.e. SVM, HMM and Logistic Regression for your data set and check the accuracies. — kamaci, Feb 23 '13 at 20:27

score 2 · Answer 1 · answered Feb 21 '13 at 14:11

in ML there is no "best solution" for this scenario, there is almost always does this method/data satisfy my needs? So, try it with simple ML technique (e.g. decision trees), if it won't work, try something more sophisticated. If it won't work try to change data,...

Neil McGuigan · Answer 2 · 2013-10-08T06:50:18.547

2

well your data will end up looking like this:

row_id  symptom_x symptom_y degree_of_symptom_z ... best_drug

1       false     true      0.8                     drug_x
2       true      null      0.0                     drug_q

And you will use a statistical classifier to learn the best drug based on the symptoms. Then you will feed it new symptoms and it will indicate the best drug.

There will probably be lots of available symptoms, so the algo needs to be able to handle many columns.

I would start with Support Vector Machine, and also try Logistic Regression.

Check out RapidMiner.

edited Oct 08 '13 at 06:50

answered Feb 21 '13 at 20:28

Neil McGuigan

46,580
12
123
152

It is probably safer to safe to assume there is no data for unknown symptoms than to assume they're 'false' (meaning the person is not showing the symptom). In a scenario with such number of unknown variables, a regression may not be the optimal choice here. – Pedro Cordeiro Mar 05 '13 at 11:49
Logistic Regression is for classification, not regression, despite the name. – Neil McGuigan Jun 29 '13 at 08:23

score 1 · Answer 3 · answered Feb 21 '13 at 14:21

I think your best bet is to identify a solid library that integrates well in your environment.

In general:

Good data helps almost always: i.e. preprocess your data to extract features ("summaries") that you think would be useful to a human too.
Avoid useless features: prefer few good features over many tricky ones that might help slightly.
Be aware that there is unlikely to be a magic black box: you'll need to tune your algorithm. Most ML algorithms have several so called "hyperparameters" that affect how the algorithm works; e.g. learning rate; smoothing; window size etc.
Since it's not a black box, find some Machine Learning introduction and get at least a basic understanding of how and why these techniques work. It's easy to get complete nonsense from an ML algorithm, so it's important to have at least some idea of how these things work so you can set up your problem appropriately.
Try something really simple first, like nearest-neighbor (you'll need a distance metric). It's possibly enough.

Though I haven't used one recently, I believe SVM's are still likely to be your best bet if NN isn't good enough. It's not the hip new thing, but they're usually pretty good without too much tuning. But it's almost always better to use a well-tuned weak algorithm (i.e. one with docs you understand and an implementation where you can try lots of hyperparam variations) than a poorly-tuned strong algorithm. Certainly if you don't really know what you're doing.

In other words: keep it simple, and make sure you use lots of common sense in feature selection phase.

score 0 · Answer 4 · answered Feb 21 '13 at 14:23

Seeing you will probably have a lot of unknown variables for this problem, I'd suggest approaching it using bayesian networks.

That would be just a guess based on that brief description and previous experience with medical diagnosis software (such as WebMD and others).

Bayesian networks tend to have higher "precision" when dealing with lots of unknown variables than most other ML algorithms (neural networks, for example, tend to need more accurate data in order to make an accurate regression - and therefore make accurate suggestions).

You'd need to do some research on overfitting prevention, smoothing and other issues you might encounter.

Again, this is not a definitive answer. You did not provide any detailed data for me to make a guess based on anything more than assumptions. I'd strongly suggest researching more deeply before deciding.

Can you point a reference for your Bayesian networks precision assumption? — kamaci, Feb 23 '13 at 20:39
I'm speaking from personal knowledge, I don't really have a reference besides ai-class.org (Stanford's class about AI. They also have a Machine Learning class at ml-class.org). — Pedro Cordeiro, Feb 23 '13 at 21:44

score 0 · Answer 5 · answered Feb 21 '13 at 14:33

0

You will need to try hundreds of algorithms, preprocessings etc. yourself.

There is no general "best algorithm" for anything.

In particular not for data-driven things, when other don't have your data.

So, try out a number of things, to see what works for you. Because what works for other must not necessarily work for you, and the other way round.

Also, experience and expertise a must in order to get good results.

answered Feb 21 '13 at 14:33

Has QUIT--Anony-Mousse

76,138
12
138
194

1

Can we say "there is no free lunch" for such kind of cases? – kamaci Feb 23 '13 at 20:37

score 0 · Answer 6 · answered Mar 05 '13 at 22:12

this is a classification problem: you have labelled data that you want to use to train a model.

As you are going to have some errors you should decide if minimise your false positive or your false negatives results and balance your algorithm to achieve that.

You can use a simple decision three and see how the performance are, using a test set like some real prescriptions form doctors.

Note that your prescription might need more than one drug or none.

One problem you should consider is that if you take some drugs you can't take others, and the patient can have some allergies. For that reason I would suggest you to have a look at http://en.wikipedia.org/wiki/Association_rule_learning and Prolog.

score 0 · Answer 7 · answered Oct 03 '18 at 15:44

Try the K nearest neighbors, I think that's a classification problem. your prescription might need more than one drug or more, and the other problem is that the machine might not always be accurate as it will be given chance to decide on that were not trained for it. you need a very detailed dataset.

The below example is Bases on ml-idea (machine learning idea) Github- ML-Idea

though there is no perfect algorithm, just prepare your data correctly as good data counts.

`

 //symptoms
  //1 = 'Symptom 1';
  //2 = 'Symptom 2';
  //3 = 'Symptom 3';
  //4 = 'Symptom 4';

$samples = [[1, 3], [1, 4], [2, 4], [3, 1], [4, 1], [4, 2]];
$labels = ['drug a', 'drug A', 'drug x', 'drug x', 'drug a', 'drug x'];

$classifier = new KNearestNeighbors(6, true);
$classifier->train($samples, $labels);
$data = $classifier->predict([2, 1]);

echo "<pre>";
print_r($data);
echo "</pre>";

`

What Machine Learning Algorithm would be appropriate for this scenario

7 Answers7