how to compress large quantity of data to save database space and computational power

Question

i have a device that its only job is to read data from multiple sensors every second. Every hour ( or unless a pull request is made) it has to push the data collected to a database. Doing this for 30 days makes the data very large. I would like to compress the data first before sending it to the database (on a different machine) via the network since space and computational time of the database is a precious resource.

The data will look close to this

TimeStamp | Sensor1 | Sensor2 | Sensor3 | Sensor4 | . . . . . | Sensor64
 00:00:01 |    1    |    0    |    0    |    0    |           |    3
 00:00:02 |    1    |    8    |    0    |    0    |           |    3
 00:00:03 |    1    |    8    |    0    |    0    |           |    3
 00:00:04 |    1    |    2    |    0    |    0    |           |    3
 00:00:05 |    0    |    8    |    0    |    0    |           |    3
 00:00:06 |    0    |    8    |    0    |    0    |           |    3
 00:00:07 |    0    |    0    |    0    |    0    |           |    3
 00:00:08 |    0    |    0    |    0    |    0    |           |    3
 00:00:09 |    0    |    0    |    0    |    0    |           |    3
 00:00:10 |    1    |    2    |    0    |    0    |           |    3

There will be most definitely a time where in the data gets repetitive(T.S 7-9 and 2-3) and would like to know a way to compress that portion for the database to store and when a webpage/app pulls the data the webpage/app will then uncompress the data to be graphed out to the user.

The planned database to be used is monggoDB (but is open to use other databases)

What i thought up with is to delete the items that repeats and when front end sees that there are missing timestamps it is understood that the item before the missing time stamp has repeated.

TimeStamp | Sensor1 | Sensor2 | Sensor3 | Sensor4 | . . . . . | Sensor64
 00:00:01 |    1    |    0    |    0    |    0    |           |    3
 00:00:02 |    1    |    8    |    0    |    0    |           |    3

 00:00:04 |    1    |    2    |    0    |    0    |           |    3
 00:00:05 |    0    |    8    |    0    |    0    |           |    3

 00:00:07 |    0    |    0    |    0    |    0    |           |    3



 00:00:10 |    1    |    2    |    0    |    0    |           |    3

but this solution is not reliable and will be more effective if there are a lot succeeding repeating data.

Is there a way to somehow takes the raw data and compress it at the bit level similar to how zip/rar works and the frontend will also be able to uncompress it.

just raw calculation, each of those sensors spits out a 16bit integer

16 bit integer x 64 sensors x 2 628 000 seconds in a month = 2 691 072 000‬ bits / 2.69 GB

2.9GB on a single pull is crazy big

Unless you are using a very specialized database (i.e. not MongoDB as you're alluding to), you're _not_ going to be able to achieve that sort of storage density to begin with. — AKX, Nov 01 '19 at 22:19
You could write it to a stream, and compress the stream on the way through using something like gzip — Mikkel, Nov 01 '19 at 22:20
The real question is what do you need to do with the data? If you really need to have second-by-second values from all sensors, there's not much you can do but to send and store everything. However, if all you need is, say, min/max/average per minute, there are databases designed for aggregation like that. — AKX, Nov 01 '19 at 22:22
There are a lot of possibilities. I've dealt with something similar when storing Ignition sensor data. In my case, there isn't a need to retain each and every sensor value. In some cases, an average per minute is sufficient. In other cases, I only need to know when a sensor value changes. What you choose depends on how you need to use the data. (and there are other options other than the two I mentioned) — devlin carnate, Nov 01 '19 at 22:23
unfortunately i need the second by second data from the sensors since it will be used for recreating the data later on, what i am really trying to avoid is that large chunk of data being pulled, donwloading 2GB just takes time when you are not on the same network as your server — Scarlet, Nov 01 '19 at 22:37
@Mikkel you might be on to something, i will look into it how efficiently it compresses the data. — Scarlet, Nov 01 '19 at 22:40
@JonasWilms 16bit signed integer so -32 768 to +32 767 (i might be off by 1 number) — Scarlet, Nov 02 '19 at 12:13
You appear to be interested two data-reduction components: The first is reducing the volume of streaming data and the second is to reduce the volume of data that gets stored in the database. What are your query requirements on the data once it's stored: Do you need to be able to query for the detail of a specific second in time or is it sufficient to find all of the data for the minute in which a second occurs? Also, this link talks about compressing streams: [link](https://stackoverflow.com/questions/44478254/does-any-mainstream-compression-algorithm-natively-support-streaming-data). — djhallx, Nov 14 '19 at 03:32

how to compress large quantity of data to save database space and computational power

0 Answers0