2

I am a novice when it comes to mongo as I have traditionally only worked with Oracle database. I have a mongo database that's storing bitbucket data in columns like so:

_id | _class | collectorItemId| firstEverCommit | scmUrl | scmBranch | scmAuthor | scmCommitTimestamp

There are a few more columns in there that I've omitted for the sake of time. For the scmBranch column, the column is populated with one of two strings: "master" or "develop". Here is a sample of what the data looks like: enter image description here

Here is the document view of one of the rows:

{
"_id" : ObjectId("5e39d6a0330c130006a042c6"),
"collectorItemId" : ObjectId("5e33a6b9887ef5000620a0c0"),
"firstEverCommit" : false,
"scmUrl" : "sampleRepo1",
"scmBranch" : "master",
"scmRevisionNumber" : "a2ad6842468eb55bffcbe7d700b6addd3eb11629",
"scmAuthor" : "son123",
"scmCommitTimestamp" : NumberLong(1580841662000)
}

I am now trying to formulate mongo queries that will get me the following data:

 1. For each scmUrl, If max(scmCommitTimestamp) where scmBranch =
    "develop" > max(scmCommitTimestamp) where scmBranch = "master" THEN
    count the number of rows (i.e commits) where scmBranch = "develop"
    AND scmCommitTimestamp > max(scmCommitTimestamp) where scmBranch =
    "master"

 2. For the results found in #1, find the oldest commit and newest
    commit

So far, the best mongo query I've been able to come up with is the following:

db.bitbucket.aggregate([{
    "$group": {
        "_id": {
            "scmUrl": "$scmUrl",
            "scmBranch": "$scmBranch"
        },
        "MostRecentCommit": {
            "$max": {"$toDate":"$scmCommitTimestamp"}
        }
    }
},{
    "$project": {
        "RepoName": {"$substr": ["$_id.scmUrl",39,-1]},
        "Branch": "$_id.scmBranch",
        "MostRecentCommit": "$MostRecentCommit"
    }
},{
   "$sort":{
       "RepoName":1,
       "Branch":1
       }

}
])

But this only gets me back the most recent commit for the develop branch and the master branch of each scmUrl (i.e repo), like so: enter image description here

Ideally, I'd like to get back a table of results with the following columns:

scmUrl/RepoName | Number of commits on develop branch that are not on master branch| oldest commit in develop branch that's not in master branch | newest commit in develop branch that's not in master branch

How can I modify my mongo query to extract the data that I want?

s7vr
  • 73,656
  • 11
  • 106
  • 127
user3266259
  • 369
  • 3
  • 8
  • 22

1 Answers1

1

You could try something like this.

Below query will grab the latest commit date from master for each repo. After you have the latest commit date you will join back to the same collection to pull all commits where branch is develop and has newer commits than master branch for each repo.

db.bitbucket.aggregate([
  {"$match":{"scmBranch":"master"}},
  {"$group":{"_id":"$scmUrl","recentcommitdate":{"$max":"$scmCommitTimestamp"}}},
  {"$lookup":{
   "from":"bitbucket",
    "let":{"scmUrl":"$_id","recentcommitdate":"$recentcommitdate"},
    "pipeline":[
      {"$match":{"$expr":
        {"$and":[
          {"$eq":["$scmBranch","develop"]},
          {"$eq":["$scmUrl","$$scmUrl"]},
          {"$gte":["$scmCommitTimestamp", "$$recentcommitdate"]}
        ]}
      }},
      {"$sort":{"scmCommitTimestamp":-1}}
    ],
  "as":"commits"
  }},
  {"$match":{"commits.0":{"$exists":true}}},
  {"$project":{
     "commits":{"$size":"$commits"},
     "lastcommit":{"$arrayElemAt":["$commits",0]},
     "firstcommit":{"$arrayElemAt":["$commits",-1]}
  }}
])

Sample added here https://mongoplayground.net/p/wLnFY0H_nJz

Update for revision ids

db.bitbucket.aggregate([
  {"$match":{"scmBranch":"master"}},
  {"$group":{"_id":"$scmUrl","revisionIds":{"$push":"$scmRevisionNumber"}}},
  {"$lookup":{
   "from":"bitbucket",
    "let":{"scmUrl":"$_id","revisionIds":"$revisionIds"},
    "pipeline":[
      {"$match":{"$expr":
        {"$and":[
          {"$eq":["$scmBranch","develop"]},
          {"$eq":["$scmUrl","$$scmUrl"]},
          {"$not":[{"$in":["$scmRevisionNumber","$$revisionIds"]}]}
        ]}
      }},
      {"$sort":{"scmCommitTimestamp":-1}}
    ],
  "as":"commits"
  }},
  {"$match":{"commits.0":{"$exists":true}}},
  {"$project":{
     "commits":{"$size":"$commits"},
     "lastcommit":{"$arrayElemAt":["$commits",0]},
     "firstcommit":{"$arrayElemAt":["$commits",-1]}
  }}
])
s7vr
  • 73,656
  • 11
  • 106
  • 127
  • I ran this and it did not fetch any records even though I know there are develop branches that are ahead of master in my collection – user3266259 Feb 17 '20 at 15:16
  • 1
    Sorry, fixed the invalid reference and added a mongoplayground working link. Please try now. – s7vr Feb 17 '20 at 17:00
  • 1
    Thank you so much! I've gone ahead and accepted this as the answer to my original question as it's helped a TON! However, how can I modify this query so that instead of going by timestamp, I make a set of all scmRevisionNumbers for each each repo on master. Then I do a lookup to pull all commits on develop where the scmRevisionNumber is not in the set I made. – user3266259 Feb 18 '20 at 03:40