I've got a project where physical sensors send data to the server. Data is send irregularly - after something activated a sensor, but not less often than every 20 minutes. On the server data is stored in a Posgresql database.
Data structure looks like:
Sensor Table
sensor name - string
sensor serial no - string
sensor type - foreign key to types table
Sensor Data Table
sensor - foreign key
timestamp
value 1 - boolean
value 2 - boolean
value 3 - integer
value 4 - float
...
It's expected to be no more than total 100 request/second. Data records in database should be persisted for 90 days and even more in some cases (not only 2 weeks as I thought earlier). So the total amount of records would be no more than 120 960 000/14 days. This is "safe" estimation. In reality it might be 10 times less (10 req/second, 12 960 000 of records).
I need to do some analysis on the data, like:
- Do something when a new record comes and it's "value 2" is true
- Do something when sensor X's "value 2" is true for longer than some declared time (50 minutes, 1 hour or more other times)
- Do something when sensor X's total true time for "value 2" in 24 hours is more than some declared time
- Do something when sensor X's "value 3" is true for longer than some declared time and no other sensor of type XYZ was active in this period ...
The "declared time" above is greater than or equal to 1 second.
The whole server part is developed in Django (and django-rest-framework to gather data).
The questions is how to do such data analysis efficiently, assuming that there should be real time or close to real time (1 second) monitoring of data and of time periods to trigger desired actions.
My thoughts:
Run a process that would query database every second for records that meet criteria and call specific actions (it probably would take more than 1 second)
Run some separate processes (eventlet?) one for each analysis type and then query the database every 1 second and fire specific actions.
Run one process per each sensor that continuously reports to it's subscribers: I'm true on "value 2" for longer than x seconds etc. Process is reset after new data for that sensor arrives. Some publish-subscribe solution like zeromq might be used here?
Use some other/faster solution
- Mongodb - the problem might be that mongodb's files are not compacted after data is removed (2 weeks).
- Hadoop - isn't it too big and too complex for this class of problems?
- Pandas and some HDF5 storage - the problem might be whether it's capable of doing the analysis I've described above and probably also with writes into files. But.. might work with mongo too.
Hints?
Update.
Currently the solution that seems to be simple and effective to me is:
- after data arrives on sensor A run all tests and
- store test results in some "tests" table (or redis) in a way that says:
- today at 1:15 pm run action "sensor open longer than"
- today at 1:30 pm run action "sensor open longer than in 24h period" ...
- continuously scan the above "tests" table and when it's today 1:15 pm then run desired action, etc.
- when a new signal arrives for sensor A then run all tests again, and also reset data in "tests" table.
This would require me to fire tests each time the request arrives for a specific sensor, but on the other side I'll have to scan only "tests" table, every 1 second.
Update 2
I've discovered PyTables (http://www.pytables.org/moin/PyTables), looks it's quite well suited for my use case as a data storage.