3

I am a computer science student and i have to choose the theme of my future research work. I really want to solve some scientific problems in chemistry(or maybe biology) using computers. Also I have huge interest in machine learning sphere.

I have been surfing over internet for a while, and have found some particular references on that kind of problems. But, unfortunately, that stuff is not enough for me.

So, I am interested in the Community's recommendation of particular resources that present the application of an ML technique to solve a problem in chemistry--e.g., a journal article or a good book describing typical (or the new ones) problems in chemistry being solved "in silico".

doug
  • 69,080
  • 24
  • 165
  • 199
mr_borsch
  • 81
  • 3

1 Answers1

4

i should think that chemistry, as much as any domain, would have the richest supply of problems particularly suited for ML. The rubric of problems i have in mind are QSAR (quantitative structure-activity relationships) both for naturally occurring compounds and prospectively, e.g., drug design.

Perhaps have a look at AZOrange--an entire ML library built for the sole purpose of solving chemistry problems using ML techniques. In particular, AZOrange is a re-implementation of the highly-regarded GUI-driven ML Library, Orange, specifically for the solution of QSAR problems.

In addition, here are two particularly good ones--both published within the last year and in both, ML is at the heart (the link is to the article's page on the Journal of Chemoinformatics Site and includes the full text of each article):

AZOrange-High performance open source machine learning for QSAR modeling in a graphical programming environment.

2D-Qsar for 450 types of amino acid induction peptides with a novel substructure pair descriptor having wider scope

It seems to me that the general natural of QSAR problems are ideal for study by ML:

  • a highly non-linear relationship between the expectation variables (e.g, "features") and the response variable (e.g., "class labels" or "regression estimates")

  • at least for the larger molecules, the structure-activity relationships is sufficiently complex that they are at least several generations from solution by analytical means, so any hope of accurate prediction of these relationships can only be reliably performed by empirical techniques

  • oceans of training data pairing analysis of some form of instrument-produced data (e.g., protein structure determined by x-ray crystallography) with laboratory data recording the chemical behavior behavior of that protein (e.g., reaction kinetics)


So here are a couple of suggestions for interesting and current areas of research at the ML-chemistry interface:

QSAR prediction applying current "best practices"; for instance, the technique that won the NetFlix Prize (awarded sept 2009) was not based on a state-of-the-art ML algorithm, instead it used kNN. The interesting aspects of the winning technique are:

  • the data imputation technique--the technique for re-generating the data rows having one or more feature missing; the particular technique for solving this sparsity problem is usually referred to by the term Positive Maximum Margin Matrix Factorization (or Non-Negative Maximum Margin Matrix Factorization). Perhaps there are a interesting QSAR problems which were deemed insoluble by ML techniques because of poor data quality, in particular sparsity. Armed with PMMMF, these might be good problems to revisit

  • algorithm combination--the rubric of post-processing techniques that involve combining the results of two or more classifiers was generally known to ML practitioners prior to the NetFlix Prize but in fact these techniques were rarely used. The most widely used of these techniques are AdaBoost, Gradient Boosting, and Bagging (bootstrap aggregation). I wonder if there are some QSAR problems for which the state-of-the-art ML techniques have not quite provided the resolution or prediction accuracy required by the problem context; if so, it would certainly be interesting to know if those results could be improved by combining classifiers. Aside from their often dramatic improvement on prediction accuracy, an additional advantage of these techniques is that many of them are very simple to implement. For instance, Bagging works like this: train your classifier for some number of epochs and look at the results; identify those data points in your training data that caused the poorest resolution by your classifier--i.e., the data points it consistently predicted incorrectly over many epochs; apply a higher weight to those training instances (i.e., penalize your classifier more heavily for an incorrect prediction) and re-train y our classifier with this "new" data set.

doug
  • 69,080
  • 24
  • 165
  • 199