Why is CouchDB's reduce_limit enabled by default? (Is it better to approximate SQL JOINS in MapReduce views or List views?)

Question

I'm using CouchDB, and I want to make better use of MapReduce when querying data.

My exact use case is the following:

I have many surveys. Each survey has a meterNumber, meterReading, and meterReadingDate, for example:

{
  meterNumber: 1,
  meterReading: 2050,
  meterReadingDate: 1480000000000
}

I then use a Map function do produce readings by meterNumber. There are many keys that are repeated (reading the same meter on different dates). i.e.

[
  [meterNumber, {reading: xxx, readingDate: xxx}],
  [meterNumber, {reading: xxx, readingDate: xxx}],
  [meterNumber, {reading: xxx, readingDate: xxx}],
  etc
]

I then group these before sending to the reduce function, and the reduce function should then actually EXPAND the values set. I.e. I want this:

[
  [meterNumber, [{reading:xxx, readingDate: xxx}, {reading:xxx, readingDate: xxx}, {reading:xxx, readingDate: xxx}]],
  [meterNumber, [{reading:xxx, readingDate: xxx}, {reading:xxx, readingDate: xxx}, {reading:xxx, readingDate: xxx}]],
  [meterNumber, [{reading:xxx, readingDate: xxx}, {reading:xxx, readingDate: xxx}, {reading:xxx, readingDate: xxx}]],
  etc
]

To run this MapReduce view on CouchDB I had to allow for this type of result set (Couchdb - Is it possible to deactivate the reduce_overflow_error error).

This suggests to me that I may run into performance problems with large result sets. Is this the case? Why would you have to specifically enable this setting on CouchDB?

*** EDIT

The accepted answer below pointed out to me that what I was doing in MapReduce was also possible (and better) using lists. Here is another good Stack Overflow answer on the same topic: Best way to do one-to-many "JOIN" in CouchDB

*** EDIT

Here is a reference from the CouchDB documentation: http://guide.couchdb.org/draft/transforming.html

score 3 · Accepted Answer · edited Jun 25 '16 at 20:34

3

A reduce function is intended to reduce values associated with given keys.

CouchDB reduce_limit is here to detect badly designed reduce functions, which is what you did by concatenating values... But don't panic: any newcomer in CouchDB would do the same error.

The problem with concatenating values in a reduce function is that:

it is totally unnecessary (if you need the whole list, just use a single map function),
it is very unefficient: your index will be get bigger and bigger on your disk, and you will have more and more disk access time.

So... Just write a minimal map function such as:

function(o){
  emit(o.meterNumber);
}

Don't write any reduce function. And call the view with include_docs=true.

But maybe you were not pleased with the data format? No problem: you have list functions for this. Just remember that map and reduce functions should be used for pure data processing, not for formatting purpose.

edited Jun 25 '16 at 20:34

Zach Smith

8,458
13
59
133

answered Jun 25 '16 at 14:03

Aurélien Bénel

3,775
24
45

Thanks for that comment. the reason I'm using MapReduce here to approximate a SQL JOIN. I.e. I'm scanning through 10k docs, and emitting a map of [meter number, some info]. Then I'm joining different rows that have the same meter number (aka grouping the map results). The reduce function will then have access to the key, and all the associated docs of that key in a single array so that I can do further calculations. I don't see how I can achieve this through a map function, since I can only process a single doc at a time? – Zach Smith Jun 25 '16 at 18:15
Ahhh... I see that you can aggregate rows using a list. And the benefit of this is that results of calling aren't indexed, but are calculated on the fly, meaning saved hard drive space? – Zach Smith Jun 26 '16 at 07:55
@ZachSmith Yes... In a way, the main idea of MapReduce is that the core of most of algorithms could be define as a "sort" applied to all data. For example, a simple database join can be done with what is called a "collation", which is just a `map` emitting the attributes to be linked: once results are sorted, the objects to be linked are next to each other. For more details, see: http://couchdb.readthedocs.io/en/latest/couchapp/views/joins.html?highlight=collation – Aurélien Bénel Jun 26 '16 at 10:30

Why is CouchDB's reduce_limit enabled by default? (Is it better to approximate SQL JOINS in MapReduce views or List views?)

1 Answers1

Linked