I need to downsample, or "thin", a large data set: 275 MB CSV file consisting of 10 Million data points. Each data point is a time,voltage
pair formatted (for example) as follows:
3.387903904E-04,2.68E-02
or, 338.79 usec, 0.0268 volts
The data were generated during a measurement on an oscilloscope. Following is a screen shot of the full measurement:
The duration of the full measurement pictured above is 2 msec. Time is the horizontal axis, voltage is the vertical axis. The oscilloscope recorded a measurement every 0.2 nsec (2-10 sec); thus, 107 data points in the full measurement. The voltage ranged from a nominal 0 volts to a nominal +5 volts.
I need to reduce the size of this data set "substantially" (99.9% or more) without losing the prominent features. Rather than obfuscate this question with a lot of information that may be irrelevant to a proposed downsampling/thinning approach, I will let the comments determine what information is required.
However, I will add this: Referring to the Figure below, the data points leading up to the trigger may be represented by only two (2) data points: one at the origin, the other (approx) 200usec later at the "trigger point". The data set contains 106 data points during this interval. Similarly, there are other "long" periods of near zero changes in voltage that may be represented by only two (2) data points.
As my original question was CLOSED because it was judged to be requesting an opinion, I have completely overhauled and re-written this question in a bid to have it RE-OPENED. And in a further effort to avoid the opinion label, my question is now as follows:
How would I go about using awk
or mlr
to downsample or "thin" the dataset described above? Some notes follow that may or may not be pertinent to a proposed answer.
NOTES:
I looked at two algorithms for downsampling: RDP, and VW. It's not clear to me that these algorithms will "thin" the data set sufficiently, but I'll happily consider an answer based on either of these algorithms - or any others.
It will be apparent to most, but just in case: A useful answer must capture not only "low-to-high" (0 volt-to-5 volt) transitions, but "high-to-low" (5 volt-to-0 volt) transitions as well.
A simplistic threshold approach may not be a useful answer because:
During the transition process from "low-to-high" or "high-to-low", there may be many intermediate steps of varying voltage values. Some sort of "look-ahead" function may be required to determine the next data point to be included in the "thinned" data set.