Online (as opposed to bulk processed) data mining packages

Question

By "bulk processed" I mean a static data set of facts (as in a CSV) processed all at once to extract knowledge. While "online", it uses a live backing store: facts are added as they happen ("X buys Y") and queries happen on this live data ("what would you reccomend to a person who is looking at y right now?").

I have (mis)used the term real-time, but I dont mean that results must come within a fixed time. ('''Edit: replaced real-time with online above''')

I have in mind a recommendation engine which uses live data. However all online resources (such as SO questions) I encountered make no distinction between real-time and bulk processing data mining packages. I had to search individually:

Carrot2 which reads from Lucene/Solr and other live datasets (online)
Knime which does scheduled execution on static files (bulk)
Mahout which runs on Hadoop (and Pregel-based Giraph in future) (online?)
a commercial package that integrates with Cassandra (online?)

What are the online data-mining packages?

Is there a reason why the literature makes no distinction between online and bulk processing packages? Or is all practical data-mining actually bulk operation in nature?

I edited your post, as the proper term IMHO is "online operation" (instead working on a copy of the data). — Has QUIT--Anony-Mousse, Mar 11 '12 at 11:09

score 2 · Accepted Answer · answered Mar 11 '12 at 11:17

For some algorithms, there are online versions available. For example for LOF, the local outlier factor, there is an online variant. I believe there are also online variants of k-means (and in fact, the original MacQueen version can be seen as "online", although most people turn it into an offline version by reiterating it until convergence), but see below for the problem with the k parameter.

However, online operation often comes at a significant performance cost. Up to the point where it is faster to run the full algorithm on a snapshow every hour instead of continuously updating the results. Think of internet search engines. Most large-scale search engines still do not allow "online" queries, but instead you query the last index that was built, probably a day or more ago.

Plus, online operation needs a significant amount of additional work. It's easy to compute a distance matrix, it is much harder to online update it by adding and removing columns, and synchronize all dependant results. In general, most data-mining results are just too complex to perform this. It's easy to compute the mean of a data stream, for example. But '''often there is just no known solution on updating the results without rerunning the - expensive - process'''. In other situations, you will even need to change the algorithm paramters. So at some point, a new cluster may form. k-means however is not meant to have new clusters appear. So essentially, you can't just write an online version of k-means. It will be a different algorithm, as it needs to dynamically modify the input parameter "k".

So usually, the algorithms will already be either online or offline. And a software package will not be able to turn offline algorithms into online algorithms.

marilena.oita · Answer 2 · 2015-10-14T10:00:27.840

0

online data-mining algorithms imply that they compute results in real time, and usually implies that the algorithms are incremental. That is, the model is updated each time it sees a new training instance, and no periodic retraining with a batch algorithm is needed. Many machine learning libraries, like Weka provide incremental versions of batch algorithms. Also check moa project and spark streaming. The literature does make a distinction between the two, although the most of the "traditional" ML algorithms do not work in an online mode without infrastructure and computation optimizations.

edited Oct 14 '15 at 10:00

answered Oct 13 '15 at 14:22

marilena.oita

919
8
13

Recommendation requests for off-site resources or tools are off-topic on Stack Overflow. If you answer them, you specifically reinforce the belief that Stack Overflow is a good place to answer those questions. It is not. Please don't answer these questions even if you know a good answer as most answers will be highly opinionated ("I personally like..."). You can open the flag dialogue on the question and see the close reason in full under the **off-topic** category, or in the [help/on-topic]. You should also not even answer in a comment, as the effect is similar to an actual answer. – Kyll Oct 13 '15 at 16:03
I answered the questions "What are the online data-mining packages?" and "Is there a reason why the literature makes no distinction between online and bulk processing packages?" . This question is marked as +2 so it is important. If you have a problem with the question, why do you penalize the answer? – marilena.oita Oct 14 '15 at 09:25
Independently of your personal opinion on what Stack Overflow should be, and your desire to shape that, only people with the right competencies should have the right to do an action on a Stack Overflow question. Seems like you only have the reputation, but (based on your profile) not the competency. – marilena.oita Oct 14 '15 at 09:43
+2 means two people have clicked on a button back in 2012, it is not a measure of quality. The question did get "penalized" - it is now closed. I do not act on behalf of my personal opinion, but on [community's](https://meta.stackoverflow.com/questions/262573/why-shouldnt-i-answer-off-topic-questions-faq). The question was asked back in 2012, your answer threw it up the feed of some users which then deemed fit to close it. If you have further inquiries, please [drop in chat](https://chat.stackoverflow.com/rooms/41570/so-close-vote-reviewers). – Kyll Oct 14 '15 at 10:18
I don't see this question closed, but [on hold]. In any case, the question is actual and important, unfortunately not very clear stated for everyone to understand it. – marilena.oita Oct 14 '15 at 12:31

Online (as opposed to bulk processed) data mining packages

2 Answers2