This is my first time making an account on Stackoverflow so I apologise if what I am asking is really straightforward.
What I want to do: I have a 14 million documents database of twitter data I wish to analyse. I am trying to query only those that are in a specific language and export that query to a smaller collection so that I can actually perform my analysis on it.
My issue: I can't seem to run a full query without the MongoDB Compass timing out or running indefinitely - I don't know how to make my database smaller and I can't run my analysis on it without my RAM being overused and my computer crashing.
What I have tried:
- I have tried using PyMongo since Python is the only language I know but there is not enough documentation so I am getting desperate and using the GUI so Compass
- I have tried performing my query (simple query like {language : {$eq : "en" } , "user.location" = "USA"} on a smaller database and exporting that to reduce the size of the database and it works! When I try the same thing on my real 32GB size database it either give me a timeout error OR when I increase the max time ms, it runs forever and I can't export anything.
- I have tried aggregating it in the MongoDB Compass using the $match and $project on my database, but it also times out and I can't figure out how to export it from the aggregation.
Please help me I am genuinely floored all my analysis skills are useless because I can't seem to get to the data because of the sheer size :(
If you have any other tips e.g. don't use MongoDB, use R or Hadoop for windows or smth, please let me know, at this point I'm willing to teach myself anything I can if I can get a grip on this dataset!
Thank you!