0

I was wondering if anyone can recommend app.config settings for map and reduce Javascript VM pools?

My current setup consists of two (2) Amazon EC2 m1.medium instanes in the cluster. Each server has a single CPU with ~4GB of RAM. My ring size is set to 64 partitions, with 8 JS VMs for map phases, 16 JS VMs for reduce, and 2 for hooks. I am planning on adding another instance on the cluster, to make it 3, but I'm trying to stretch as much as possible until then.

I recently encountered high wait times for queries on a set of a few thousand records (the query was to fetch the most recent 25 news feeds from a bucket of articles), resulting in timeouts. As a workaround, I passed "reduce_phase_only_1" as an argument. My query was structured as follows:

1) 2i index search 2) map phase to filter out deleted articles 3) reduce phase to sort on creation time (this is where i added reduce_phase_only_1 arg) 4) reduce phase to slice the top of results

Anyone know how to alleviate the bottleneck?

Cheers,

-Victor

Vic
  • 53
  • 5

1 Answers1

2

Your Map phase functions are going to execute in parallel close to the data while the reduce phase generally runs iteratively on a single node using a single VM. You should therefore increase the number of VMs in the pool for map phases and reduce the pool size for Reduce phases. This has been described in greater detail here.

I would also recommend not using the reduce_phase_only_1 flag as it will allow you to pre-reduce if volumes grow, although this will result in a number of reduce phase functions running in parallel, which will require a larger pool size. You could also merge your two reduce phase functions into one and at each stage sort before cutting excessive results.

MapReduce is a flexible way to query your data, but also quite expensive, especially compared to direct key access. It is therefore best suited for batch type jobs where you can control the level of concurrency and the amount of load you put on the system through MapReduce. It is generally not recommended to use it to serve user driven queries as it can overload the cluster if there is a spike in traffic.

Instead of generating the appropriate data for every request, it is very common to de-normalise and pre-compute data when using Riak. In your case you might be able to keep lists of news in separate summary objects and update these as news are inserted, deleted or updated. This adds a bit more work when inserting, but will make reads much more efficient and scalable as it can be served through a single GET request rather than a MapReduce job. If you have a read heavy application this is often a very good design.

If inserts and updates are too frequent, thereby making it difficult to update these summary objects efficiently, it may be possible to have a batch job do this at specific time intervals instead if it is acceptable that the view may not be 100% up to date.

Community
  • 1
  • 1
Christian Dahlqvist
  • 1,665
  • 12
  • 9
  • Thanks for the insight Christian. I had a question about "reduce_phase_only_1". You mentioned that using that flag was not recommended because it "will allow you to pre-reduce if volumes grow"? or is it that using that flag will not? I don't feel too good about using it either, but until I can afford to add more instances on the cluster, I'm kind of stuck :( In regards to combining the functions... genious!, if you hadn't mentioned it, I probably would not have noticed that area of improvement. – Vic Mar 13 '14 at 08:43
  • Using the `reduce_phase_only_1` flag requires all data to be collected on the coordinating node before the reduce phase is run. This is not a problem for small data sets, but will become problematic once volumes grow. If you write your reduce function according to the guidelines, you have the potential of increasing efficiency by enabling `do_prereduce` once volumes grow. – Christian Dahlqvist Mar 14 '14 at 09:31
  • In a pre-reduce phase, what would you suggest to shrink all of the data down to a subset? Would something like filtering out news feed articles based on their timestamp work? – Vic Mar 14 '14 at 11:41
  • A reduce phase run in pre-reduce mode allows the first reduce iteration to be performed on the node where the data resides and only the results from this reduce phase sent across the cluster for the next iteration on the coordinating node. If each reduce iteration sorts and then drops data that is certain to not be needed, less data can be sent around the cluster, especially if a reasonably large batch size is used. – Christian Dahlqvist Mar 14 '14 at 12:35