0

I have a complex CSV file (here is external link because even a small part of it wouldn't look nice on SO) where a particular column may be composed of several columns separated by space.

reset,angle,sine,multiStepPredictions.actual,multiStepPredictions.1,anomalyScore,multiStepBestPredictions.actual,multiStepBestPredictions.1,anomalyLabel,multiStepBestPredictions:multiStep:errorMetric='altMAPE':steps=[1]:window=1000:field=sine,multiStepBestPredictions:multiStep:errorMetric='aae':steps=[1]:window=1000:field=sine
int,string,string,string,string,string,string,string,string,float,float
R,,,,,,,,,,
0,0.0,0.0,0.0,None,1.0,0.0,None,[],0,0
0,0.0314159265359,0.0314107590781,0.0314107590781,{0.0: 1.0},1.0,0.0314107590781,0.0,[],100.0,0.0314107590781
0,0.0628318530718,0.0627905195293,0.0627905195293,{0.0: 0.0039840637450199202    0.03141075907812829: 0.99601593625497931},1.0,0.0627905195293,0.0314107590781,[],66.6556977331,0.0313952597647
0,0.0942477796077,0.0941083133185,0.0941083133185,{0.03141075907812829: 1.0},1.0,0.0941083133185,0.0314107590781,[],66.63923621,0.0418293579232
0,0.125663706144,0.125333233564,0.125333233564,{0.06279051952931337: 0.98942669172932329     0.03141075907812829: 0.010573308270676691},1.0,0.125333233564,0.0627905195293,[],59.9506102238,0.0470076969512
0,0.157079632679,0.15643446504,0.15643446504,{0.03141075907812829: 0.0040463956041429626     0.09410831331851431: 0.94917381047888194    0.06279051952931337: 0.046779793916975114},1.0,0.15643446504,0.0941083133185,[],53.2586756624,0.0500713879053
0,0.188495559215,0.187381314586,0.187381314586,{0.12533323356430426: 0.85789473684210527     0.09410831331851431: 0.14210526315789476},1.0,0.187381314586,0.125333233564,[],47.5170631454,0.0520675034246

For viewing I am using this trick column -s,$'\t' -t < *.csv | less -#2 -N -S which is an upgraded version borrowed from Command line CSV viewer. If I'm using this trick is explicitly clear what is the 1st 2nd 3rd ... column and what is the data which are composed of several space separated data in particular column.

My question is if there is any trick to manipulating such complex CSV? I know that I can use awk to filter 5th column, then from this filtered column filter again 2nd column to get the desired portion of complex data, but I need to watch if there wasn't another composed column before 5th (so I need to get actually 6th not 5th column etc) some columns may contain also mix of composed and non composed data. So awk is probably not right tool.

The CSV viewer link mentions a tool called csvlook which adds to output pipes as a separator. This could be more easy to filter because pipes will delimit columns and white spaces will delimit composed data on one column. But I cannot run csvlook with multiple delimiters (comma and tab) as I did for column so it did not generate data properly. What is the most comfortable way of handling this?

Community
  • 1
  • 1
Wakan Tanka
  • 7,542
  • 16
  • 69
  • 122
  • It's not exactly clear how you want to analyze it. But if I did `cut -f5 -d,` on your file, I see the 5th column (multiStepPredictions.1) for this data. – Kedar Mhaswade May 18 '15 at 01:42
  • As an aside: the `$'\t'` in `column -s,$'\t' -t < *.csv | less -#2 -N -S` has no effect, because there are no tabs in the input file. – mklement0 May 18 '15 at 23:06

2 Answers2

2

As long as your input doesn't contain columns with escaped embedded , chars., you should be able to parse it with awk, using , as the field separator; e.g.:

awk -F, '{ n = split($5, subField, "[[:blank:]]+"); for (i=1;i<=n;++i) print subField[i] }' file.csv

The above splits the 5th field into sub-fields by whitespace, using the split() function.

mklement0
  • 382,024
  • 64
  • 607
  • 775
  • and what if it would contain escaped `,` ? – Wakan Tanka May 18 '15 at 21:25
  • What would such an escaped `,` look like? Optional use of _double-quoted_ field values, which are then free to contain embedded commas? Or `\ `-escaping? The right approach depends on the answer. – mklement0 May 18 '15 at 21:32
  • I thought that there are some utils that can handle this in both cases. – Wakan Tanka May 18 '15 at 21:35
  • I know that various scripting languages have CSV-parsing modules, which should at least be able to handle _double-quoted_ fields; are you open to using Perl, Ruby, or Python? If so, do you have a preference? Do, note, however, that your linked CSV file contains neither `"` nor `\ ` instances. – mklement0 May 18 '15 at 21:38
  • It was mentioned just as example of how one csv can be complex. Python and Perl are fine but I would rather prefer some command line tools. Something like `csvgrep` or something like that if exists. In other words something that is aware of escaping quoting etc and I need not handle this manually as it would be using `awk` if I use `,` as separator. – Wakan Tanka May 18 '15 at 22:25
  • Your CSV parsing needs involve _custom_ logic: further parsing the _internals_ of a CSV field, which is _incidental to CSV parsing itself_; I'm not sure any standard tool will help you with that. If there are no embedded `,` instances, you might as well extend my sample `awk` script. – mklement0 May 18 '15 at 23:04
0

Take a look at cut command. You can specify a list of fields, or a range of fields.

Daniel777
  • 851
  • 8
  • 20
  • Does cut offer something that awk doesn't? – Wakan Tanka May 18 '15 at 05:25
  • 1
    `cut` is simpler, but far less powerful. The most notable limitation of `cut` is that it only ever recognizes a _single_ instance of a delimiter (separator) character as separating fields (multiple contiguous instances are invariably interpreted as separating _empty_ fields). Thus, in the case at hand, you _could_ use it to extract CSV fields (assuming there are no escaped embedded commas), but you _couldn't_ use it (easily) to split the 5th-column into tokens by whitespace, because the amount of whitespace varies. – mklement0 May 18 '15 at 21:35