1

I have a bucket with approximately 900,000 records. The majority of those records have a status of PERSISTED in a secondary index. I want to retrieve all base_urls and a count of how many documents belong to each base_url for all docs that are marked PERSISTED.

Here is the query:

curl -X POST -H "content-type: application/json" \
    http://localhost:8098/mapred?chunked=true --data @-<<\EOF
{
    "timeout":600000,
    "inputs":{
       "bucket":"test-bucket",
       "index":"status_bin",
       "key":"PERSISTED"
    },
    "query":[{
        "map":{
            "language":"javascript",
            "source":"
                function(value, keyData, arg) {
                    var data = Riak.mapValuesJson(value)[0];
                    var obj = {};
                    obj[data.base_url] = 1;
                    return [obj];
                }
            "
        }
    },
    {
        "reduce":{
            "language":"javascript",
            "source":"
                function(values, arg){ 
                    return [values.reduce(
                        function(acc, item){ 
                            for(var base_url in item){
                                if(acc[base_url]) {
                                    acc[base_url] = acc[base_url] + 1
                                } else {
                                    acc[base_url] = item[base_url];
                                }
                            }
                            return acc;
                        })
                    ];
                }
            "
        }
    }]

EOF

This is timing out after 10 minutes.

I am on a 16 core 3Ghz AWS node with 20Gb of memory.

Is there something that I am possibly doing wrong, either with my configuration or with the above query?

Should it possibly take so long?

To give perspective, the equivalent query in MySQL would look something like this

SELECT COUNT(*), catalog FROM urls GROUP BY catalog;

I have not tried it, but I suspect that in MySQL a result set from the above query over 900,000 records would take several seconds. I am not meaning to compare Riak to MySQL since I realize that they are very different, but I am wondering how I can at the very least, execute the above query in under 10 minutes.

Thanks!

chaimp
  • 16,897
  • 16
  • 53
  • 86

1 Answers1

1

JavaScript MapReduce jobs in Riak use a pool of SpiderMonkey JavaScript VMs, and it is important to tune the size of this pool depending on your usage pattern in order to avoid, or at least reduce, contention. The size of the pool is specified through the 'map_js_vm_count' and 'reduce_js_vm_count' parameters in the app.config file.

As you are running on a single node and have only a single map phase, I would recommend you setting the 'map_js_vm_count' parameter to the size of your ring, which by default is 64. A more in-depth description can be found here.

While map phase processing scale easily and is done in parallel, a central reduce phase can easily become the bottleneck as this is run recursively on a single node. This can be addresses by passing a parameter to the map phase to enable pre-reduce and increase the reduce phase batch size as described here. Enabling pre-reduce will allow the first iteration of the reduce phase to run in parallel, which most likely will increase the efficiency of your job. You will however need to increase the number of VMs available to reduce phase functions by increasing the 'reduce_js_vm_count' parameter quite a bit.

If running large MapReduce jobs concurrently, the number of JavaScript VMs required to support this can become quite large. Converting map and reduce phase functions into Erlang is generally encouraged as it does eliminate JS VM contention and also performs better due to less VM related overhead. This is always recommended for MapReduce jobs that you intend to run on a regular basis.

Community
  • 1
  • 1
Christian Dahlqvist
  • 1,665
  • 12
  • 9
  • Do you think that if I tune it properly, I would be able to get all unique values for the base_url field in under a few minutes (a big stretch from taking over an hour)? I ask this, because I tried a simple find command across the data/leveldb/ directory, examining the strings inside all of the .sst files and piping to grep for a certain field and it captured over 1 million records in under 2 minutes. While I understand that there is no comparison since the disk operation did not involve random reads, I would expect Riak to at least be close to that. Do you think I have a shot? – chaimp Apr 11 '13 at 03:31
  • Riak will need to read a copy of the entire object from disk before passing it into the map phase as the keys are identified, and this will result in a lot of non-sequential disk reads. I therefore doubt it will be as fast as scanning the data files sequentially and searching for patterns, but it is hard for me to tell how long it will take as it depends a lot on the I/O performance of your system. – Christian Dahlqvist Apr 11 '13 at 04:01
  • I tried the suggestions and it ran for about 8 minutes and then gave a "fitting" error. I am getting the idea that Riak is not intended to be used in the way that I am trying to use it. Do you think that what I am trying to accomplish can be better done by pre-storing the data (like you suggested in another answer). Even if I get past the possible Java VM bottleneck, it seems like just the time to read each object from disk is a bottleneck and there is no way around that, right? – chaimp Apr 11 '13 at 04:09
  • Riak MapReduce is designed to scale across a cluster, and has not been optimized to run on a single machine. It generally works best when run on datasets which are not too large, as it does a lot of random reads. You may get better results by aggregating data upfront, which is a common approach. Another approach may be to add a secondary index containing a timestamp and create separate aggregation records periodically based on this, e.g. once per hour (depending on data volumes). – Christian Dahlqvist Apr 11 '13 at 04:19
  • How you model your data depends a lot on your access patterns and how you need to be able to query your data. If you need to serve a lot of requests based on aggregated data like in your example, it makes of sense to 'precompute' this (at least to some extent) and store it in separate records. Try to make sure that the vast majority of your queries can be efficiently served through direct key lookups in order to ensure optimal performance and scalability. – Christian Dahlqvist Apr 11 '13 at 04:27
  • Aha, thank you for explaining that. Would there be any significant performance gains if I were to run multiple instances on a single machine? Regarding the idea to aggregate data up front, how would you recommend storing it? I was thinking about a separate bucket where this data that I want to access are the key names, but that seems like a hack to me. Is there a more common way to do that? And, regarding the timestamp approach, are you suggesting that I write all of the keys that I want to store to a single object? per timestamp If so, would that be a problem with simultaneous write? – chaimp Apr 11 '13 at 04:31
  • In my current situation, I would not say that I "serve a lot of requests based on aggregated data", but rather periodically need to access this as part of a larger build cycle. Still, a relatively simple operation of grabbing a single field across the whole data set should not take many hours or necessitate additional hardware. I think the idea of building these aggregates is the way to go. That said, I am interested in this "common approach" that you describe to store the data separately and how to deal with simultaneous writes to the same record. – chaimp Apr 11 '13 at 04:36