I have the following scenario:
Around 70 million of equipments send a signal every 3~5 minutes to the server sending its id, status (online or offiline), IP, location (latitude and longitude), parent node and some other information.
The other information might not be in an standard format (so no schema for me) but I still need to query it.
The equipments might disappear for some time (or forever) not sending signals in the process. So I need a way to "forget" the equipments if they have not sent a signal in the last X days. Also new equipments might come online at any time.
I need to query all this data. Like knowing how many equipments are offline on a specific region or over an IP range. There won't be many queries running at the same time.
Some of the queries need to run fast (less than 3 min per query) and at the same time as the database is updating. So I need indexes on the main attributes (id, status, IP, location and parent node). The query results do not need to be 100% accurate, eventual consistency is fine as long as it doesn't take too long (more than 20 min on avarage) for them to appear in the queries results.
I don't need persistence at all, if the power goes out it's okay to lose everything.
Given all this I thought of using a noSQL approach maybe MongoDB or CouchDB since I have experience with MapReduce and Javascript but I don't know which one is better for my problem (I'm gravitating towards CouchDB) or if they are fit at all to handle this massive workload. I don't even know if I actually need a "traditional" database since I don't need persistence to disk (maybe a main-memory approach would be better?), but I do need a way to build custom queries easily.
The main problem I detect are the following:
Need to insert/update lots of tuples really fast and I don't know beforehand if the signal I receive is already in the database or not. Almost all of the signals will be in the same state as they were the last time, so maybe query by id and check to see if the tuple changed if not do nothing, if it did update?
Forgeting offline equipments. A batch job that runs during the night removing expired tuples would solve this problem.
There won't be many queries running at the same time, but they need to run fast. So I guess I need to have a cluster that perform a single query on multiple nodes of the cluster (does CouchDB MapReduce splits the workload to multiple nodes of the cluster?). I'm not enterily sure I need a cluster though, could a single more expensive machine handle all the load?
I have never used a noSQL system before, but I have theoretical knowledge of the subject.