3

I have a collection called 'my_emails' where are stored email addresses :

[
    { email:"russel@gmail.com"},
    { email:"mickey@yahoo.com"},
    { email:"john@yahoo.com"},
]

and I try to get the top 10 hostnames used...

[
{host: "gmail.com", count: 1000},
{host: "yahoo.com", count: 989}, ...
]

if I had MySQL, I’ll do this query :

SELECT substr(email,locate('@',email)+1,255) AS host,count(1) AS count
FROM my_emails 
WHERE email like '%@%' 
GROUP BY substr(email,locate('@',email)+1,255)
ORDER BY count(1) DESC 
LIMIT 10

how can I do with mongodb ? I try without result something like this :

db.my_emails.aggregate([ { $group : {_id : "$host", count : { $sum : 1 }}}]);

I don't know how to make the $host value without adding a new property to my records

styvane
  • 59,869
  • 19
  • 150
  • 156
sly63
  • 305
  • 2
  • 6
  • You will likely get better answer if you provide sample document with expected result. Also you should avoid title like "*MongoDB vs SQL query*" – styvane Oct 07 '15 at 21:14

2 Answers2

2

MongoDB doesn't provide any operator like locate but you can use .mapReduce to do this:

db.collection.mapReduce(
    function() {
        emit(this.email.substr(this.email.indexOf('@') + 1), 1);  
    }, 
    function(host, count) { 
        return Array.sum(count) ; }, 
    { out: "hosts" } 
)

Then db.hosts.find().sort({ 'value': -1 }).limit(10) returns top 10 hostname:

{ "_id" : "yahoo.com", "value" : 2 }
{ "_id" : "gmail.com", "value" : 1 }
styvane
  • 59,869
  • 19
  • 150
  • 156
0

An alternative workaround would be to modify your data structure by introducing another field in your schema which holds only the domain value of the email address. This can be done with a bulk update using the Bulk API operations that give a better write response i.e. useful information about what actually happened during the update:

var bulk = db.my_emails.initializeUnorderedBulkOp(),    
    count = 0;

db.my_emails.find().forEach(function(doc) {
    var domain = doc.email.replace(/.*@/, ""),
        update = { domain: domain };
    bulk.find({ "_id": doc._id }).updateOne({
        "$set": update
    })
    count++;
    if (count % 1000 == 0) {            
        bulk.execute();
        bulk = db.my_emails.initializeUnorderedBulkOp();
    }
})

if (count % 1000 != 0) { bulk.execute(); } 

Bulk update response from sample:

BulkWriteResult({
    "writeErrors" : [ ],
    "writeConcernErrors" : [ ],
    "nInserted" : 0,
    "nUpserted" : 0,
    "nMatched" : 3,
    "nModified" : 3,
    "nRemoved" : 0,
    "upserted" : [ ]
})

After this update, a query on the collection db.my_emails.find().pretty() will yield:

{
    "_id" : ObjectId("561618af645a64b1a70af2c5"),
    "email" : "russel@gmail.com",
    "domain" : "gmail.com"
}
{
    "_id" : ObjectId("561618af645a64b1a70af2c6"),
    "email" : "mickey@yahoo.com",
    "domain" : "yahoo.com"
}
{
    "_id" : ObjectId("561618af645a64b1a70af2c7"),
    "email" : "john@yahoo.com",
    "domain" : "yahoo.com"
}

Now, having the domain field will make it easier for the aggregation framework to give you the host count through the $sum operator in the $group pipeline. The following pipeline operation will return the desired outcome:

db.my_emails.aggregate([
    {
        "$group": {
            "_id": "$domain",
            "count": { "$sum": 1 }
        }
    }
])

Output:

{ "_id" : "yahoo.com", "count" : 2 }
{ "_id" : "gmail.com", "count" : 1 }
chridam
  • 100,957
  • 23
  • 236
  • 235