I work on a project that sits on a large-ish pile of raw data, aggregates from which are used to power a public-facing informational site (some simple aggregates like various totals and top-tens of totals, and some somewhat-more-complicated aggregates). At present we update it once every few months, which involves adding new data, and possibly updating or deleting existing records, and re-running all the aggregation off-line, after which new aggregates are deployed to production.
We're interested in ramping up the frequency of updates, such that re-aggregating everything from scratch isn't practical, so we'd like to do rolling aggregation that updates the existing aggregates to reflect new, changed, or deleted records.
CouchDB's MapReduce implementation offers roughly the facility that I'm looking for: it stores the intermediate state of MapReduce tasks in a big B-tree, where the output of maps is at the leaves, and reduce operations gradually join branches together. New, updated, or deleted records cause subtrees to be marked as dirty and recomputed, but only the relevant portions of the reduce tree need to be touched, and intermediate results from non-dirty subtrees can be re-used as is.
For a variety of reasons, though (uncertainty about CouchDB's future, lack of convenient support for non-MR one-off queries, current SQL-heavy implementation, etc.), we'd prefer not to use CouchDB for this project, so I'm looking for other implementations of this kind of tree-ish incremental map-reduce strategy (possibly, but not necessarily, atop Hadoop or similar).
To pre-empt some possible responses:
- I'm aware of MongoDB's supposed support for incremental MapReduce; it's not the real thing, in my opinion, because it really only works well for additions to the dataset, not updates or deletes.
- I'm also aware of the Incoop paper. This describes exactly what I want, but I don't think they've made their implementation public.