By "bulk processed" I mean a static data set of facts (as in a CSV) processed all at once to extract knowledge. While "online", it uses a live backing store: facts are added as they happen ("X buys Y") and queries happen on this live data ("what would you reccomend to a person who is looking at y right now?").
I have (mis)used the term real-time, but I dont mean that results must come within a fixed time. ('''Edit: replaced real-time with online above''')
I have in mind a recommendation engine which uses live data. However all online resources (such as SO questions) I encountered make no distinction between real-time and bulk processing data mining packages. I had to search individually:
- Carrot2 which reads from Lucene/Solr and other live datasets (online)
- Knime which does scheduled execution on static files (bulk)
- Mahout which runs on Hadoop (and Pregel-based Giraph in future) (online?)
- a commercial package that integrates with Cassandra (online?)
What are the online data-mining packages?
Is there a reason why the literature makes no distinction between online and bulk processing packages? Or is all practical data-mining actually bulk operation in nature?