3

I have records in database that contains URLs. For example, https://www.youtube.com/watch?v=blablabla.

I want to count URLs for each site. For example

[{
    site: 'youtube.com',
    count: 25
},
{
    site: 'facebook.com',
    count: 135
}]

I used this aggregation pipeline:

db.getCollection('records').aggregate([
    {'$match': {'url': /.*youtube\.com.*/}}, // youtube for example
    {'$group': {'_id': {'site': '$url', 'count': {'$sum': 1}}}},
    {'$project': {'_id': false, 'site': '$_id.site', 'count': '$_id.count'}}
]);

which outputs:

[{
    "site" : "youtube.com/blablabla1",
    "count" : 1.0
},
{
    "site" : "youtube.com",
    "count" : 1.0
},
{
    "site" : "www.youtube.com/blablabla2",
    "count" : 1.0
},
{
    "site" : "www.youtube.com/blablabla1",
    "count" : 1.0
}]

It won't even count identical strings correctly.

What is wrong with my approach?

Neil Lunn
  • 148,042
  • 36
  • 346
  • 317

1 Answers1

1

This will count all websites:

Website name is determinated by this regex:

const testData = ['https://www.youtube.com/watch?v=UbQgXeY_zi4&list=RDUbQgXeY_zi4&index=1', 'https://www.facebook.com/maciej.kozieja.9', 'http://example.com', 'http://www.example.com']

const sites = testData.map(site => (site + '/').match(/(?:https?:\/\/)?(?:www\.)?([\w.]+)(?=\/)/)[1])

console.log(sites)

Then we have to use mapReduce function on our colection:

db.collection('links').mapReduce(
    function () {
        emit((this.site + '/').match(/(?:https?:\/\/)?(?:www\.)?([\w.]+)(?=\/)/)[1], 1)
    },
    function (key, values) {
        return values.length
    }, { out: 'websiteLinksCount' }
)

then we can do something with it

.then(x => {
    x.find({}).toArray((error, x) => {
        console.log(x) // here you have array of [{_id: siteName, value: count}]
    })
})
Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
Maciej Kozieja
  • 1,812
  • 1
  • 13
  • 32