I have a bucket with approximately 900,000 records. The majority of those records have a status of PERSISTED in a secondary index. I want to retrieve all base_urls and a count of how many documents belong to each base_url for all docs that are marked PERSISTED.
Here is the query:
curl -X POST -H "content-type: application/json" \
http://localhost:8098/mapred?chunked=true --data @-<<\EOF
{
"timeout":600000,
"inputs":{
"bucket":"test-bucket",
"index":"status_bin",
"key":"PERSISTED"
},
"query":[{
"map":{
"language":"javascript",
"source":"
function(value, keyData, arg) {
var data = Riak.mapValuesJson(value)[0];
var obj = {};
obj[data.base_url] = 1;
return [obj];
}
"
}
},
{
"reduce":{
"language":"javascript",
"source":"
function(values, arg){
return [values.reduce(
function(acc, item){
for(var base_url in item){
if(acc[base_url]) {
acc[base_url] = acc[base_url] + 1
} else {
acc[base_url] = item[base_url];
}
}
return acc;
})
];
}
"
}
}]
EOF
This is timing out after 10 minutes.
I am on a 16 core 3Ghz AWS node with 20Gb of memory.
Is there something that I am possibly doing wrong, either with my configuration or with the above query?
Should it possibly take so long?
To give perspective, the equivalent query in MySQL would look something like this
SELECT COUNT(*), catalog FROM urls GROUP BY catalog;
I have not tried it, but I suspect that in MySQL a result set from the above query over 900,000 records would take several seconds. I am not meaning to compare Riak to MySQL since I realize that they are very different, but I am wondering how I can at the very least, execute the above query in under 10 minutes.
Thanks!