12

I work on a project that sits on a large-ish pile of raw data, aggregates from which are used to power a public-facing informational site (some simple aggregates like various totals and top-tens of totals, and some somewhat-more-complicated aggregates). At present we update it once every few months, which involves adding new data, and possibly updating or deleting existing records, and re-running all the aggregation off-line, after which new aggregates are deployed to production.

We're interested in ramping up the frequency of updates, such that re-aggregating everything from scratch isn't practical, so we'd like to do rolling aggregation that updates the existing aggregates to reflect new, changed, or deleted records.

CouchDB's MapReduce implementation offers roughly the facility that I'm looking for: it stores the intermediate state of MapReduce tasks in a big B-tree, where the output of maps is at the leaves, and reduce operations gradually join branches together. New, updated, or deleted records cause subtrees to be marked as dirty and recomputed, but only the relevant portions of the reduce tree need to be touched, and intermediate results from non-dirty subtrees can be re-used as is.

For a variety of reasons, though (uncertainty about CouchDB's future, lack of convenient support for non-MR one-off queries, current SQL-heavy implementation, etc.), we'd prefer not to use CouchDB for this project, so I'm looking for other implementations of this kind of tree-ish incremental map-reduce strategy (possibly, but not necessarily, atop Hadoop or similar).

To pre-empt some possible responses:

  • I'm aware of MongoDB's supposed support for incremental MapReduce; it's not the real thing, in my opinion, because it really only works well for additions to the dataset, not updates or deletes.
  • I'm also aware of the Incoop paper. This describes exactly what I want, but I don't think they've made their implementation public.
tshepang
  • 12,111
  • 21
  • 91
  • 136
Andrew Pendleton
  • 761
  • 5
  • 13
  • How frequently are you planning to update your data? How widespread are your updates? Does your data lend itself to be partitioned in some natural way into chunks of comparable size, say by calendar date? – Olaf Jun 19 '13 at 19:23
  • Why don't have a look at Apache HBase. – subZero Jun 29 '14 at 18:47
  • why are you only looking for tree-ish strategy? – vishnu viswanath Jan 21 '15 at 04:01
  • @sonic I'm not necessarily. What I want, though, is for changes in individual records to only require the recomputation of aggregates that were affected by those records (or possibly those and a few others collaterally, but not the entire dataset). Trees seem like a logical way to accomplish that. – Andrew Pendleton Apr 01 '15 at 21:41
  • Can the current map-reduce solution accept a) the set of changed files b) old aggregate values from the previous run along with metadata about the previous run. Based on these two things, can we calculate the new aggregate just by going through the set of changed files? – Joydip Datta Aug 25 '15 at 09:27
  • As you have hadoop, have you considered simply using hive? I think it should be able to do the things that you mention (but i'm not sure whether it will properly do the things that you did not mention specifically). – Dennis Jaheruddin Aug 01 '17 at 08:19

1 Answers1

1

The first thing that comes to mind for me is still Hive, as it now has features such as materialized views, which could hold your aggregates and get selectively invalidated if the underlying data changers.

Though it is not too new, Uber actually published a fairly timeless article on how they approached the challenge of incremental updates, and even the solutions they only reference may be very interesting for this usecase. The key takeaways of this article:

  1. Getting really specific about your actual latency needs can save you tons of money.
  2. Hadoop can solve a lot more problems by employing primitives to support incremental processing.
  3. Unified architectures (code + infrastructure) are the way of the future.

Full disclosure: I am an employee of Cloudera. A provider of Big Data platforms including Hadoop, which contain various tools that are referenced in the article, and also Hive which I referred to directly.

Dennis Jaheruddin
  • 21,208
  • 8
  • 66
  • 122