2

I am submitting a pretty simple query to MongoDB (version 2.6) using the pymongo library for python:

query = {"type": "prime"}
logging.info("Querying the DB")
docs = usaspending.get_records_from_db(query)
logging.info("Done querying. Sorting the results")
docs.sort("timestamp", pymongo.ASCENDING)
logging.info("Done sorting the results, getting count")
count = docs.count(True)
logging.info("Done counting: %s records", count)
pprint(docs[0])
raise Exception("End the script right here")

The get_records_from_db() function is quite simple:

def get_records_from_db(query=None):
    return db.raws.find(query, batch_size=50)

Note that I will actually need to work with all the documents, not just docs[0]. I am just trying to get docs[0] as an example.

When I run this query the output I get is:

2015-01-28 10:11:05,945 Querying the DB
2015-01-28 10:11:05,946 Done querying. Sorting the results
2015-01-28 10:11:05,946 Done sorting the results, getting count
2015-01-28 10:11:06,617 Done counting: 559952 records

However I never get back docs[0]. I have an indexes on {"timestamp": 1} and {"type": 1}, and queries seem to work reasonably well (as the count is returned quite fast), but I am not sure why I never get back the actual document (the docs are quite small [under 50K]).

Arnob
  • 467
  • 1
  • 4
  • 13

2 Answers2

1

PyMongo does no actual work on the server when you execute these lines:

query = {"type": "prime"}
docs = usaspending.get_records_from_db(query)
docs.sort("timestamp", pymongo.ASCENDING)

At this point "docs" is just a PyMongo Cursor, but it has not executed the query on the server. If you run "count" on the Cursor, then PyMongo does a "count" command on the server and returns the result, but the Cursor itself still hasn't been executed.

However, when you run this:

docs[0]

Then in order to get the first result, PyMongo runs the query on the server. The query is filtered on "type" and sorted by "timestamp", so try this on the mongo shell to see what's wrong with the query:

> db.collection.find({type: "prime"}).sort({timestamp: 1}).limit(1).explain()

If you see a very large "nscanned" or "nscannedObjects", that's the problem. You probably need a compound index on type and timestamp (order matters):

> db.collection.createIndex({type: 1, timestamp: 1})

See my article on compound indexes.

A. Jesse Jiryu Davis
  • 23,641
  • 4
  • 57
  • 70
  • That query itself takes too long to complete :-) – Arnob Jan 29 '15 at 02:17
  • Which query? The "explain" takes too long? – A. Jesse Jiryu Davis Jan 29 '15 at 02:42
  • In that case try creating the index. – A. Jesse Jiryu Davis Jan 29 '15 at 13:36
  • Yes. Creating the index fixed the issue. I wonder why the performance was so slow? Perhaps a collection scan was being performed. I will mark this as the answer. – Arnob Jan 29 '15 at 14:56
  • Thanks! It was slow because your existing indexes, {"timestamp": 1} and {"type": 1}, weren't very efficient for your query. MongoDB chose *one* of them to use, but then it either had to use the "type" index and do an in-memory sort on "timestamp", or it had to use the "timestamp" index for sorting and skip many records with the wrong type. A compound index gets you the answer very efficiently. – A. Jesse Jiryu Davis Jan 29 '15 at 16:30
0

The reason you never get back the actual documents is because Mongo batches together these commands into one query, so you would look at it this way:

find the records then sort the records and then count the records.

You need to build two totally separate queries:

  1. find the records, sort the records, then give me the records
  2. count the records

If you chain them, Mongo will chain them and think that they are one command.

EmptyArsenal
  • 7,314
  • 4
  • 33
  • 56