Time based data analysis with Python

Question

I've got a project where physical sensors send data to the server. Data is send irregularly - after something activated a sensor, but not less often than every 20 minutes. On the server data is stored in a Posgresql database.

Data structure looks like:

Sensor Table
    sensor name - string
    sensor serial no - string
    sensor type - foreign key to types table

Sensor Data Table
    sensor - foreign key
    timestamp
    value 1 - boolean
    value 2 - boolean
    value 3 - integer
    value 4 - float
    ...

It's expected to be no more than total 100 request/second. Data records in database should be persisted for 90 days and even more in some cases (not only 2 weeks as I thought earlier). So the total amount of records would be no more than 120 960 000/14 days. This is "safe" estimation. In reality it might be 10 times less (10 req/second, 12 960 000 of records).

I need to do some analysis on the data, like:

Do something when a new record comes and it's "value 2" is true
Do something when sensor X's "value 2" is true for longer than some declared time (50 minutes, 1 hour or more other times)
Do something when sensor X's total true time for "value 2" in 24 hours is more than some declared time
Do something when sensor X's "value 3" is true for longer than some declared time and no other sensor of type XYZ was active in this period ...

The "declared time" above is greater than or equal to 1 second.

The whole server part is developed in Django (and django-rest-framework to gather data).

The questions is how to do such data analysis efficiently, assuming that there should be real time or close to real time (1 second) monitoring of data and of time periods to trigger desired actions.

My thoughts:

Run a process that would query database every second for records that meet criteria and call specific actions (it probably would take more than 1 second)
Run some separate processes (eventlet?) one for each analysis type and then query the database every 1 second and fire specific actions.
Run one process per each sensor that continuously reports to it's subscribers: I'm true on "value 2" for longer than x seconds etc. Process is reset after new data for that sensor arrives. Some publish-subscribe solution like zeromq might be used here?
Use some other/faster solution
- Mongodb - the problem might be that mongodb's files are not compacted after data is removed (2 weeks).
- Hadoop - isn't it too big and too complex for this class of problems?
- Pandas and some HDF5 storage - the problem might be whether it's capable of doing the analysis I've described above and probably also with writes into files. But.. might work with mongo too.

Hints?

Update.

Currently the solution that seems to be simple and effective to me is:

after data arrives on sensor A run all tests and
store test results in some "tests" table (or redis) in a way that says:
- today at 1:15 pm run action "sensor open longer than"
- today at 1:30 pm run action "sensor open longer than in 24h period" ...
continuously scan the above "tests" table and when it's today 1:15 pm then run desired action, etc.
when a new signal arrives for sensor A then run all tests again, and also reset data in "tests" table.

This would require me to fire tests each time the request arrives for a specific sensor, but on the other side I'll have to scan only "tests" table, every 1 second.

Update 2

I've discovered PyTables (http://www.pytables.org/moin/PyTables), looks it's quite well suited for my use case as a data storage.

score 2 · Answer 1 · edited May 23 '17 at 12:01

My first stab at this would be to create a multi-column index on "Sensor Data Table", of the like :

sensor->timestamp->value1 //Index 1
sensor->timestamp->value2 //Index 2
sensor->timestamp->value3 //Index 3
sensor->timestamp->value4 //Index 4

See if your SQL queries are fast enough. You could query it via eventlets or cron. From a performance perspective it doesn't matter which you use as long as this query is fast enough, it is most likely to be your bottleneck.

Another suggestion is to try MySQL Memory tables, or the postgre equivalent (In-memory table in PostgreSQL).

Yet another suggestion is to try Redis. You can store "Sensor Data" as a collection of sorted sets; One sorted set per sensor id and value field, and sort data by timestamps.

 ZADD sensor_id:value1 timestamp value
 ZADD sensor_id:value2 timestamp value

Redis will require some application logic to accumalate the data, but it will be very fast if it all fits in RAM.

Re: MongoDB. You can get good perf. as long as your queryable data + indexes can fit in RAM and there aren't too many write locks. Albeit it's an administrative (and coding) burden to run 2 heavy-weight databases that provide overlapping features. Given that, compaction is not really an issue. You can create TTL indexes on sensor data and mongo will delete older data in a bg thread. The file size will remain constant after a while.

Hope this helps

Thanks, these are really valuable solutions. I'd go with sql first and see how fast is it. — eXt, Apr 21 '13 at 20:59
Too fast with previous comment, so: I'd go with sql first and see how fast is it. In-memory tables would be fast but I have to persist data. Redis... interesting but I'm not sure if it would be appropiate to: 1. store so much data there (it has to be persisted) 2. filter this data, for example I have to calculate total period of inactivities in a 24 hours (I also do some other analysis of gathered data, but this doesn't happen very often) Mongo. Would 100 req/second be enought to cause write locks? I would probably have to test how fast mongo is and compare it to results from sql db. — eXt, Apr 21 '13 at 21:07
Super! Glad you appreciate it. Redis, unlike Memcache, stores data persistently. It stores as much as it can in RAM and pages in/out when needs. Given that, it seems to me your requirements are better suited for SQL. Re: Mongo. # of Reqs. have nothing to do with locks. Mongo is great at read-only, good w write-only, not that good with read-write because write locks block the entire database (no other reads/writes can happen) so too many writes will cause your reads to timeout. If your data + indexes fit in RAM, it will be pleasant. As you said, you'll only know if you try. — Adil, Apr 22 '13 at 10:15

score 0 · Answer 2 · answered Apr 15 '13 at 15:23

If your rules are simple or few you could try and use SQL triggers to update stored views, which might be quickly queried. E.g. assuming, that you want to detect, that a certain sensor has been active for a given amount of time, you could have a table containing activation times of sensors active at present. Whenever you store a raw event, a trigger would update such a table.

It would be more difficult for the rules of type 3. Unless there is either few of them, and you can setup a set of triggers and views for each one, or allowed time periods are known upfront.

score 0 · Answer 3 · answered Apr 15 '13 at 19:40

0

Option #4. The relational database is the clear bottleneck. Arrange for data delivery in some simpler form (files in a directory, named by sensor name or whatever the unique key is). You can process much more quickly, checking timestamps and reading-- then push data to the rdb at the back end, after you analyze.

answered Apr 15 '13 at 19:40

I can't expect files containing data from the whole day. Data is continuously sent to the server. So I have to receive it and save. Also I can't say that I won't have to reanalyze this data - for example when a new monitoring query is created, like: now I want to be informed when sensor is open longer than a week. So my system has to analyze data as old as 1 week) – eXt Apr 21 '13 at 21:14

score 0 · Answer 4 · answered Apr 15 '13 at 19:55

Have you thought about bucketing by problem and stream processing the input data?

A sample approach of this type would be:

for event in new_events: # new events to be written from the RESTful service
    for test in tests: # known tests you need to perform
        increment/decrement <test.bucket>.<sensor>.<valueX> # update the bucket as necessary
        # e.g. seconds_since_changed.sensor1.value3 += 1
     # additionally, save each record to a standard database for long-term purposes
     # and to be able to rebuild/redo tests as necessary.

The ideal storage mechanism for such a transaction stream is one that lends itself well to low-latency writes/updates. NoSQL solutions work very well for this sort of work (I've personally used Redis for a similar type of analysis, but you're in no ways bound to that particular solution.)

Time based data analysis with Python

4 Answers4