0

I'm running Erlang Map/Reduce jobs on Riak.

When in the past I used Javascript M/R jobs, I had to tune the JS VM settings properly. At the time I found this conversation to be extremely useful: to http://riak-users.197444.n3.nabble.com/Follow-up-Riak-Map-Reduce-error-preflist-exhausted-td4024330.html

Now, because I'm not an Erlang developer, I wonder what are the main implications when running concurrent M/R jobs on Riak and if there's any VM settings to set (like I used to do with JS M/R).

Thanks

Kara
  • 6,115
  • 16
  • 50
  • 57
Mark
  • 67,098
  • 47
  • 117
  • 162

2 Answers2

1

Currently we found this riak mapred gotchas:

  • worker_limit_reached. This is happens when you have a lot of data arriving to mapred job and job's queue full
  • read with r=1. All your data inside mapreduce is read with r=1
  • no read repair. Mapreduce reads does not trigger read reapair
  • you may get already deleted data. Inside mapred check special header of object, which indicates that object is already deleted

p.s. this is about riak 1.2.1. Basho folks quickly resolve many issues, so it may be changed in near future.

danechkin
  • 1,306
  • 8
  • 15
  • Thank you. I am more concerned about the performance issues in running, for example, 20 concurrent heavy M/R jobs in a cluster. I think that as long as I have free CPU/Memory available, I won't be in trouble, right? – Mark Dec 14 '12 at 07:46
  • If you already write erlang code, you could use riak pipes https://github.com/basho/riak_pipe (this is engine of riak's mapreduce). We use riak pipes directly and this allow us to run heavy mapreduce jobs without such problems. – danechkin Dec 14 '12 at 07:50
0

Basically what happens here is that all phases of map/reduce query is performed by ErlangVM, not by Erlang+JS. Since the jobs are isolated in ErlangVM in separate processes, operations are not affected. Host-wise you have the same computational power, so it is also OK. Regarding ErlangVM parameters, many of them were tweaked to improve Riak operatinos and your query is good to go.

user425720
  • 3,578
  • 1
  • 21
  • 23
  • That means I can run 20 concurrent heavy M/R jobs in a cluster and don't worry about it as long as I have enough CPU/Memory available on my machines? – Mark Dec 14 '12 at 07:16