It is going to be a "general-ish" question but I have a reason for that. I am asking this because I am not sure what kind of approach shall I take to make things faster.
I have a mongoDB server running on a BIG aws instance (r3.4xlarge 16 core vCPU and 122 GB primary memory). The database has one HUGE collection which has 293017368 documents in it and some of them have a field called paid
which holds a string value and some of them do not. Also some of them them have an array called payment_history
some of them do not. I need to perform some tasks on that database but ALL the documents that do not have either paid
or payment_history
or both is not relevant to me. So I thought of cleaning (shrinking) the DB before I proceed with actual operations. I thought that as I have to check something like ({paid: {$exists: false}}) to delete records for the first step I should create an index over paid
. I can see that at the present rate the index will take 80 days to be finished.
I am not sure what should be my approach for this situation? Shall I write a map-reduce to visit each and every document and perform whatever I need to perform in one shot and write the resulting documents in a different collection? Or shall I somehow (not sure how though) split up the huge DB in small sections and apply the transforms on each of them and then merge the resultant records from each server into a final cleaned
record set? Or shall I somehow (not sure how) put that data in a Elastic Map-Reduce or redshift to operate on it? In a nutshell, what do you people think the best route to take for such a situation?
I am sorry in advance if this question sounds a little bit vague. I tried to explain the real situation as much as I could.
Thanks a lot in advance for help :)
EDIT
According to the comment about sparse indexing, I am now performing a partialIndexing. with this command - db.mycol.createIndex({paid: 1}, {partialFilterExpression: {paid: {$exists: true}}})
it roughly creates 53 indices per second... at this rate I am not sure how long it will take for the entire collection to get indexed. But I am keeping it on for the night and I will come back tomorrow here to update this question. I intend to hold this question the entire journey that I will go through, just for the sake of people in future with the same problem and same situation.