Dealing with a LARGE data in mongodb

Question

It is going to be a "general-ish" question but I have a reason for that. I am asking this because I am not sure what kind of approach shall I take to make things faster.

I have a mongoDB server running on a BIG aws instance (r3.4xlarge 16 core vCPU and 122 GB primary memory). The database has one HUGE collection which has 293017368 documents in it and some of them have a field called paid which holds a string value and some of them do not. Also some of them them have an array called payment_history some of them do not. I need to perform some tasks on that database but ALL the documents that do not have either paid or payment_history or both is not relevant to me. So I thought of cleaning (shrinking) the DB before I proceed with actual operations. I thought that as I have to check something like ({paid: {$exists: false}}) to delete records for the first step I should create an index over paid. I can see that at the present rate the index will take 80 days to be finished.

I am not sure what should be my approach for this situation? Shall I write a map-reduce to visit each and every document and perform whatever I need to perform in one shot and write the resulting documents in a different collection? Or shall I somehow (not sure how though) split up the huge DB in small sections and apply the transforms on each of them and then merge the resultant records from each server into a final cleaned record set? Or shall I somehow (not sure how) put that data in a Elastic Map-Reduce or redshift to operate on it? In a nutshell, what do you people think the best route to take for such a situation?

I am sorry in advance if this question sounds a little bit vague. I tried to explain the real situation as much as I could.

Thanks a lot in advance for help :)

EDIT

According to the comment about sparse indexing, I am now performing a partialIndexing. with this command - db.mycol.createIndex({paid: 1}, {partialFilterExpression: {paid: {$exists: true}}}) it roughly creates 53 indices per second... at this rate I am not sure how long it will take for the entire collection to get indexed. But I am keeping it on for the night and I will come back tomorrow here to update this question. I intend to hold this question the entire journey that I will go through, just for the sake of people in future with the same problem and same situation.

What sort of index have you thought about? Did you look at Sparse index which will have records only if a specific attribute is present? — notionquest, Aug 30 '16 at 16:44
I did not know about this indexing.But thanks for the indication :) I looked at the documentation. It looks like I can actually create a partial-index as per here - https://docs.mongodb.com/manual/core/index-partial/#index-type-partial. I will edit and update my question once I do this experiment and properly narrow down the scope more. Thanks — SRC, Aug 30 '16 at 17:05
I could not find a proper way to do this via mongo only. The sparse or partial index was taking too long. So I decided to set up a proper Spark cluster and run a map-reduce job on the mongo collection and visit each document in that process and processing them for everything therein and writing back the final record set to another mongo collection. I am actually working on that now. Thanks for the indication. It will help me in future. :) — SRC, Sep 02 '16 at 09:13

Dealing with a LARGE data in mongodb

0 Answers0