i should think that chemistry, as much as any domain, would have the richest supply of problems particularly suited for ML. The rubric of problems i have in mind are QSAR (quantitative structure-activity relationships) both for naturally occurring compounds and prospectively, e.g., drug design.
Perhaps have a look at AZOrange--an entire ML library built for the sole purpose of solving chemistry problems using ML techniques. In particular, AZOrange is a re-implementation of the highly-regarded GUI-driven ML Library, Orange, specifically for the solution of QSAR problems.
In addition, here are two particularly good ones--both published within the last year and in both, ML is at the heart (the link is to the article's page on the Journal of Chemoinformatics Site and includes the full text of each article):
AZOrange-High performance open source machine learning for QSAR modeling in a graphical programming environment.
2D-Qsar for 450 types of amino acid induction peptides with a novel substructure pair descriptor having wider scope
It seems to me that the general natural of QSAR problems are ideal for study by ML:
a highly non-linear relationship between the expectation variables
(e.g, "features") and the response variable (e.g., "class labels" or
"regression estimates")
at least for the larger molecules, the structure-activity
relationships is sufficiently complex that they are at least several
generations from solution by analytical means, so any hope of
accurate prediction of these relationships can only be reliably
performed by empirical techniques
oceans of training data pairing analysis of some form of
instrument-produced data (e.g., protein structure determined by x-ray
crystallography) with laboratory data recording the chemical behavior
behavior of that protein (e.g., reaction kinetics)
So here are a couple of suggestions for interesting and current areas of research at the ML-chemistry interface:
QSAR prediction applying current "best practices"; for instance, the technique that won the NetFlix Prize (awarded sept 2009) was not based on a state-of-the-art ML algorithm, instead it used kNN. The interesting aspects of the winning technique are:
the data imputation technique--the technique for re-generating the data rows having one or more feature missing; the particular
technique for solving this sparsity problem is usually referred to by
the term Positive Maximum Margin Matrix Factorization (or
Non-Negative Maximum Margin Matrix Factorization). Perhaps there are
a interesting QSAR problems which were deemed insoluble by ML
techniques because of poor data quality, in particular sparsity.
Armed with PMMMF, these might be good problems to revisit
algorithm combination--the rubric of post-processing techniques that involve combining the results of two or more
classifiers was generally known to ML practitioners prior to the
NetFlix Prize but in fact these techniques were rarely used. The most
widely used of these techniques are AdaBoost, Gradient Boosting, and
Bagging (bootstrap aggregation). I wonder if there are some QSAR
problems for which the state-of-the-art ML techniques have not quite
provided the resolution or prediction accuracy required by the
problem context; if so, it would certainly be interesting to know if
those results could be improved by combining classifiers. Aside from their often dramatic improvement on prediction accuracy, an additional advantage of these techniques is that many of them are very simple to implement. For instance, Bagging works like this: train your classifier for some number of epochs and look at the results; identify those data points in your training data that caused the poorest resolution by your classifier--i.e., the data points it consistently predicted incorrectly over many epochs; apply a higher weight to those training instances (i.e., penalize your classifier more heavily for an incorrect prediction) and re-train y our classifier with this "new" data set.