3

I have > 8000 records in my DB and this is one of them :

{
    "_id" : ObjectId("57599c498c39598eafb781b9"),
    "_class" : "vn.cdt.entity.db.AccessLog",
    "url" : "/shop/huenguyenshop/browse",
    "ip" : "10.0.0.238",
    "sessionId" : "86E5CF8E6D465A6EDFE7C9BF7890AA4B",
    "oldSessionId" : "86E5CF8E6D465A6EDFE7C9BF7890AA4B",
    "cookie" : "{\"sessionId\":\"86E5CF8E6D465A6EDFE7C9BF7890AA4B\",\"objects\":[{\"id\":\"903815555908\",\"type\":\"VIEW_SHOP\",\"count\":1}]}",
    "isCookie" : true,
    "createTime" : NumberLong(1464935913641),
    "objectId" : "903815555908",
    "type" : "VIEW_SHOP"
}

What i want to do :

I want to find all record have same oldSessionId with (type: VIEW_ITEM or type: BUY_ITEM) and createTime is lastest.

What i had tried :

pipeline = ([
                {"$group" : { "_id": "$oldSessionId", "count": { "$sum": 1 } }},
                {"$match": {"count" : {"$gt": 1} } },
                {"$project": {"oldSessionId" : "$_id", "_id" : 0} }
            ])

But that pipeline give me only sessionId

    find({'createTime': {'$lt':1464419127000, '$gt':1464332727000}, 
'$or':[{'type':'BUY_ITEM'},{'type':'VIEW_ITEM'}]})

That find give me all records with type: VIEW_ITEM or type: BUY_ITEM in specific time.

I don't know how to add filter with type and createTime to get what i want.

Update Thanks @chridam for helping me :

If i want add specific date to aggregation, i can make add query like this :

 pipeline = \
    (
        [
            { "$match": {
                         "createTime": {"$lt":1464419127000, "$gt":1464332727000 },
                         "type": { "$in": ["VIEW_ITEM", "BUY_ITEM"] }
                        }
            },
            { "$sort": { "createTime": -1, "oldSessionId": 1 } },
            {
                "$group":
                    { "_id": "$oldSessionId",
                      "_class": { "$first": "$_class" },
                      "url": { "$first": "$url" },
                      "ip": { "$first": "$ip" },
                      "sessionId": { "$first": "$sessionId" },
                      "oldSessionId": { "$first": "$oldSessionId" },
                      "cookie": { "$first": "$cookie" },
                      "isCookie": { "$first": "$isCookie" },
                      "createTime": { "$first": "$createTime" },
                      "objectId": { "$first": "$objectId" },
                      "type": { "$first": "$type" },
                    }
            }

        ]

    )
Phan Duc
  • 141
  • 1
  • 8

1 Answers1

1

To get all the documents that have the same oldSessionId with (type: VIEW_ITEM or type: BUY_ITEM) and createTime is lastest, you need to conduct an aggregation pipeline show that has the following actors (stages):

  1. $match stage:

    • This will filter all documents that have a type of either VIEW_ITEM or BUY_ITEM. You can use the $in operator with the query as it allows you to select the documents where the value of the type field equals any value a the specified array, which happens to be a list with the two possible type values i.e. ["VIEW_ITEM", "BUY_ITEM"].
  2. $sort stage

    • this will afford the documents from the previous pipeline (above) to be ordered. This is necessary because you want to aggregate these filtered documents on the latest createTime field.
  3. $group stage

    • In this ultimate step, you group all the ordered documents by the oldSessionId key, add the fields you want using the $first operator.

Piecing all the above pipes together to form the following aggregation pipeline:

pipeline = [
    { "$match": {  "type": { "$in": ["VIEW_ITEM", "BUY_ITEM"] } } },
    { "$sort": { "createTime": -1, "oldSessionId": 1 } },
    {
        "$group": {
            "_id": "$oldSessionId",
            "_class": { "$first": "$_class" },
            "url": { "$first": "$url" },
            "ip": { "$first": "$ip" },
            "sessionId": { "$first": "$sessionId" },
            "cookie": { "$first": "$cookie" },
            "isCookie": { "$first": "$isCookie" },
            "createTime": { "$first": "$createTime" },
            "objectId": { "$first": "$objectId" },
            "type": { "$first": "$type" },
        }
    }
]
chridam
  • 100,957
  • 23
  • 236
  • 235
  • again you help me solved my problem. Thank you so much @chridam. ` "cookie": : { "$first": "$cookie" } ` has two ":" after "cookie". ` "cookie": { "$first": "$cookie" }` . If i put `.count() ` after `aggregate(pipeline)` its show error `TypeError: db.getCollection(...).aggregate(...).count is not a function : @(shell):1:1` How i can check number of record after aggregation ? – Phan Duc Jun 22 '16 at 10:24
  • Thanks for noticing the typo and you are welcome, always happy to help :) As for the follow-up query, the `aggregate()` method does not have a `count()` method, it's only found on the find() method, so you cannot apply the function on `aggregate()`. Seems like a new problem what you want to achieve, can you please create a new question for it? – chridam Jun 22 '16 at 10:27
  • i used pymongo and make a loop to count number of documents. And its show ok ! – Phan Duc Jun 22 '16 at 11:38
  • when i put ` { "$sort": { "createTime": -1, "oldSessionId": 1 } },` the program show error : **MongoDB : Sort exceeded memory limit of 104857600 bytes** . If i remove ` { "$sort": { "createTime": -1, "oldSessionId": 1 } },` and use `$last` replace `$first`, Is result still the same ? – Phan Duc Jun 23 '16 at 04:13
  • The result wont be the same because the documents are not ordered. Please read the [**manual**](https://docs.mongodb.com/manual/reference/operator/aggregation/first/#grp._S_first) to get a better understanding. – chridam Jun 23 '16 at 06:53
  • how to use your `pipeline` with specific time like this : `{'createTime': {'$lt':1464419127000, '$gt':1464332727000}` . Cause your `pipeline` load all the records but if i want load record in specific time ? – Phan Duc Jun 24 '16 at 05:03
  • @PhanDuc Please create a new question for this if you want a detailed response; questions in comments that are different from the original post are not recommended here on SO as editing existing answers to address additional questions will deface the post. However, to answer your question in brief, you need to include that query within the `$match` pipeline i.e. it should be `{ "$match": { "createTime": { "$lt": 1464419127000, "$gt":1464332727000 }, "type": { "$in": ["VIEW_ITEM", "BUY_ITEM"] } } }, ...` – chridam Jun 24 '16 at 06:51