How to "thin" a large CSV file to extract its salient features?

Question

I need to downsample, or "thin", a large data set: 275 MB CSV file consisting of 10 Million data points. Each data point is a time,voltage pair formatted (for example) as follows:

3.387903904E-04,2.68E-02

or, 338.79 usec, 0.0268 volts

The data were generated during a measurement on an oscilloscope. Following is a screen shot of the full measurement:

The duration of the full measurement pictured above is 2 msec. Time is the horizontal axis, voltage is the vertical axis. The oscilloscope recorded a measurement every 0.2 nsec (2^-10 sec); thus, 10⁷ data points in the full measurement. The voltage ranged from a nominal 0 volts to a nominal +5 volts.

I need to reduce the size of this data set "substantially" (99.9% or more) without losing the prominent features. Rather than obfuscate this question with a lot of information that may be irrelevant to a proposed downsampling/thinning approach, I will let the comments determine what information is required.

However, I will add this: Referring to the Figure below, the data points leading up to the trigger may be represented by only two (2) data points: one at the origin, the other (approx) 200usec later at the "trigger point". The data set contains 10⁶ data points during this interval. Similarly, there are other "long" periods of near zero changes in voltage that may be represented by only two (2) data points.

As my original question was CLOSED because it was judged to be requesting an opinion, I have completely overhauled and re-written this question in a bid to have it RE-OPENED. And in a further effort to avoid the opinion label, my question is now as follows:

How would I go about using awk or mlr to downsample or "thin" the dataset described above? Some notes follow that may or may not be pertinent to a proposed answer.

NOTES:

I looked at two algorithms for downsampling: RDP, and VW. It's not clear to me that these algorithms will "thin" the data set sufficiently, but I'll happily consider an answer based on either of these algorithms - or any others.
It will be apparent to most, but just in case: A useful answer must capture not only "low-to-high" (0 volt-to-5 volt) transitions, but "high-to-low" (5 volt-to-0 volt) transitions as well.
A simplistic threshold approach may not be a useful answer because:

During the transition process from "low-to-high" or "high-to-low", there may be many intermediate steps of varying voltage values. Some sort of "look-ahead" function may be required to determine the next data point to be included in the "thinned" data set.

Looks like can choose the parameters for RDP or VW to get the desired amount of points — llllvvuu, Dec 31 '20 at 05:59
@LawrenceWu: Are you referring to the *epsilon* parameter? It seems that would *influence* the number of data points, but not set it to an absolute value. Or am I missing something? — , Dec 31 '20 at 17:59
There are many ways to take random samples from a CSV file. You can use the Linux shell command `shuf`, or you can read your data into a Pandas dataframe and call `sample()`. Is there anything specific about your data that necessitates a specialized algorithm? — stackoverflowuser2010, Dec 31 '20 at 19:32
@stackoverflowuser2010: Nothing except that *random* sampling strikes me as inappropriate for this data set. I need to reduce 10 million samples of a non-linear waveform to about 5,000 - a 99.95% reduction. Maybe I could get something useful, but maybe not. A *threshold-based* scheme makes sense. And the RDP & VW algorithms make sense. "Random" would, by definition, discard some features that I need to retain. — , Dec 31 '20 at 23:34
Yes, for VW you can pick exactly the number of points, and with RDP you can tune epsilon (for example by binary search) — llllvvuu, Jan 01 '21 at 19:54
FWIW: I've made 3 attempts to have the erroneous "opinion-based" flag removed, so I'll give up. But for anyone who's interested, [I did get an answer here - on U&L SE.](https://unix.stackexchange.com/questions/627800/can-awk-sum-a-column-over-a-specified-number-of-lines) — , Jan 12 '21 at 19:46
This is sad. It seems like this question should be open. I didn't look for the original question that may have solicited an opinion, but this one does not seem to do so. SO needs to have an "appeal" button. — Mr. Lance E Sloan, Jun 25 '21 at 00:55
@Seamus: I tried to make an appeal to SO moderators by flagging the question. They declined to help. I think you cannot get this question reopened, even though you changed it to no longer be opinion-based. If you're still interested in getting answers, I think the only way to do it would be to open a new question with the same content. Good luck! — Mr. Lance E Sloan, Jul 01 '21 at 01:20
@Mr.LanceESloan: Thanks for your effort, but I decided months ago this was a waste of time. As I mentioned in my earlier comment, I got a good answer in the U&L forum. I also appreciated Lawrence Wu's comments, but not being a mathematician, I worried about VW maintaining the voltages as a **function** of time - as opposed to a line on a map - which obviously needn't be a function. — , Jul 01 '21 at 05:53

Allan Wind · Answer 1 · 2021-01-03T05:30:28.830

2

Here is one way to implement a random sample:

awk -v r=0.005 '/.*/ { if(rand() < r) { print } }' data.csv

This prints out the first value and only subsequent values if they are greater than the delta voltage thresholds:

 awk -v delta=0.5 -v FS=, 'NR == 1 { print; last=$2 } NR > 1 { if(last - $2 <= -delta || last - $2 >= delta) { print }; last=$2 }' data.csv

With the example data:

  1,1 # low
  2,1 # filter out
  3,2 # high
  4,2 # filter out
  5,1 # low
  6,1 # filter out

Here is the result that I am getting (low to high, and high to low transitions):

   awk -v delta=1 ... data.csv
   1,1 # low
   3,2 # high
   5,1 # low

edited Jan 03 '21 at 05:30

answered Dec 31 '20 at 05:36

Allan Wind

23,068
5
28
38

There's nothing wrong with asking for an opinion. If you don't have one, then just say you don't know. – stackoverflowuser2010 Dec 31 '20 at 19:27
There are explicit close statuses for requests for opinions. – Allan Wind Dec 31 '20 at 21:35
1

Everything is an opinion. The answer you gave with the `awk` command is your opinion. – stackoverflowuser2010 Dec 31 '20 at 22:05
You are entitled to believe in whatever you want. These are probably the most relevant link to support my case that's it's off-topic https://stackoverflow.com/help/dont-ask and https://stackoverflow.com/help/on-topic – Allan Wind Dec 31 '20 at 22:29
"Off-topic" - or "requesting an opinion"? – Jan 01 '21 at 21:47
I appreciate the effort that went into this answer, but it seems to have a significant flaw: It recognizes "low-to-high" transitions in the data, but cannot recognize a "high-to-low" transition. – Jan 02 '21 at 23:56
I tweaked the code to use <= and >= instead of < and > to make the example values easier to read. Then ran example data with a low-high-low transition and it seems to give me the expected result. – Allan Wind Jan 03 '21 at 05:33

How to "thin" a large CSV file to extract its salient features?

NOTES:

1 Answers1