24

People often throw around the terms IR, ML, and data mining, but I have noticed a lot of overlap between them.

From people with experience in these fields, what exactly draws the line between these?

petezurich
  • 9,280
  • 9
  • 43
  • 57
Boris Yeltz
  • 2,341
  • 5
  • 21
  • 20

4 Answers4

27

This is just the view of one person (formally trained in ML); others might see things quite differently.

Machine Learning is probably the most homogeneous of these three terms, and the most consistently applied--it's limited to the pattern-extraction (or pattern-matching) algorithms themselves.

Of the terms you mentioned, "Machine Learning" is the one most used by Academic Departments to describe their Curricula, their academic departments, and their research programs, as well as the term most used in academic journals and conferences proceedings. ML is clearly the least context-dependent of the terms you mentioned.

Information Retrieval and Data Mining are much closer to describing complete commercial processes--i.e., from user query to retrieval/delivery of relevant results. ML algorithms might be somewhere in that process flow, and in the more sophisticated applications, often are, but that's not a formal requirement. In addition, the term Data Mining seems usually to refer to application of some process flow on big data (i.e, > 2BG) and therefore usually includes a distributed processing (map-reduce) component near the front of that workflow.

So Information Retrieval (IR) and Data Mining (DM) are related to Machine Learning (ML) in an Infrastructure-Algorithm kind of way. In other words, Machine Learning is one source of tools used to solve problems in Information Retrieval. But it's only one source of tools. But IR doesn't depend on ML--for instance, a particular IR project might be storage and rapid retrieval of the fully-indexed data responsive to a user's search query IR, the crux of which is optimizing performance of the data flow, i.e., the round-trip from query to delivering the search results to the user. Prediction or pattern matching might not be useful here. Likewise, a DM project might use an ML algorithm for the predictive engine, yet a DM project is more likely to also be concerned with the entire processing flow--for instance, parallel computation techniques for efficient input of an enormous data volume (TB perhaps) which delivers a proto-result to a processing engine for computation of descriptive statistics (mean, standard deviation, distribution, etc. on the variables (columns).

Lastly consider the Netflix Prize. This competition was directed solely to Machine Learning--the focus was on the prediction algorithm, as evidenced by the fact that there was a single success criterion: accuracy of the predictions returned by the algorithm. Imagine if the 'Netflix Prize' were rebranded as a Data Mining competition. The success criteria would almost certainly be expanded to more accurately access the algorithm's performance in the actual commercial setting--so for instance overall execution speed (how quickly are the recommendations delivered to the user) would probably be considered along with accuracy.

The terms "Information Retrieval" and "Data Mining" are now in mainstream use, though for a while I only saw these terms in my job description or in vendor literature (usually next to the word "solution.") At my employer, we recently hired a "Data Mining" analyst. I don't know what he does exactly, but he wears a tie to work every day.

Michael Plazzer
  • 447
  • 1
  • 6
  • 18
doug
  • 69,080
  • 24
  • 165
  • 199
  • 1
    (+1) I also like the distinction being made by Radford Neale: "Many machine learning problems have a large number of variables — maybe 10,000, or 100,000, or more (eg, genes, pixels). Data mining applications often involve very large numbers of cases — sometimes millions." ([sta414](http://www.utstat.utoronto.ca/~radford/sta414/), [week1](http://www.utstat.utoronto.ca/~radford/sta414/week1a.pdf)). – chl Aug 18 '11 at 12:45
  • Data mining also suffers from being a total buzzword. Todays, computing the mean value of a "big data" data set already is considered "data mining" by some, unfortunately. – Has QUIT--Anony-Mousse Mar 09 '12 at 07:14
  • 2
    He wears a tie to work huh. That gives me a very good idea of what he might be doing :-) – smartnut007 Mar 21 '12 at 01:13
20

I'd try to draw the line as follows:

Information retrieval is about finding something that already is part of your data, as fast as possible.

Machine learning are techniques to generalize existing knowledge to new data, as accurate as possible.

Data mining is primarly about discovering something hidden in your data, that you did not know before, as "new" as possible.

They intersect and often use techniques of one another. DM and IR both use index structures to accelerate processes. DM uses a lot of ML techniques, for example a pattern in the data set that is useful for generalization might be a new knowledge.

They are often hard to separate. Do yourself a favor and don't just go for the buzzwords. In my opinion the best way of distinguishing them is by their intention, as given above: find data, generalize to new data, find new properties of existing data.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • I do not agree with your view on machine learning. Your view is more focused on supervised learning (as your statement would be correct). Unsupervised learning however is about finding patterns that one does not know about, hence with **no prior existing knowledge**. – basickarl Aug 11 '15 at 15:34
  • Unsupervised learning is an oxymoron. Unsupervised methods are DM, not ML. They don't learn, how could they, so don't squeeze them into the learning view at all. – Has QUIT--Anony-Mousse Aug 11 '15 at 19:30
  • I believe you are referring to storage, that unsupervised methods do not **remember** after they have executed. I do agree, the terminology is flawed in AI, but as it currently stands, unsupervised is under machine learning, so I do not agree with your post still. Also DM does not necessarily use unsupervised learning methods (although it mostly does) so saying unsupervised learning is equal to DM is indeed very wrong. – basickarl Aug 11 '15 at 19:42
  • Define "learning" if we want to get anywhere here. To me, "learning" is the generalization from training data. I don't see this happen e.g. in clustering - there is no training data. – Has QUIT--Anony-Mousse Aug 11 '15 at 19:59
  • Personally I use the English meaning of the word, "The acquisition of knowledge or skills through study, experience, or being taught.". Supervised referring to being taught via learning data and unsupervised via study/experience therefore it learns. So I guess our different views arise from the interpretation of the word learning. – basickarl Aug 11 '15 at 21:53
  • To some extend. But also because I find that the ML point of view just fails to understand most unsupervised methods, because of the obsession with optimizing a particular quality criterion. Instead of telling people it is the "same, but different", it would help people a lot to see it is an orthogonal approach: **discovery instead of learning**. – Has QUIT--Anony-Mousse Aug 11 '15 at 22:14
4

You can also add pattern recognition and (computational?) statistics as another couple of areas that overlap with the three you mentioned.

I'd say there is no well-defined line between them. What separates them is their history and their emphases. Statistics emphasizes mathematical rigor, data mining emphasizes scaling to large datasets, ML is somewhere in between.

Community
  • 1
  • 1
dimatura
  • 1,230
  • 1
  • 11
  • 14
0

Data mining is about discovering hidden patterns or unknown knowledge, which can be used for decision making by people.

Machine learning is about learning a model to classify new objects.

Razan Paul
  • 13,618
  • 3
  • 69
  • 61