gnu parallel + sed to edit both csv header and contents

Question

I'm trying to use command line tools to edit some CSV I have in the following format for several year folders:

dataset
- year_1 (i.e. 1929)
  - csv_filename_1.csv
  - csv_filename_2.csv
  - csv_filename_3.csv
  - ...
- year_2
  - ...

I'm trying to append the file name to its content, creating a new column called filename with ./year_1/csv_filename_1.csv to all columns in it. After that, I would gzip it.

Due to the number of year folders (almost 100) and the CSVs quantities in each (totaling 100k+), I plan to use gnu parallel to run it, and

I was trying to use sed doing something like

fname="1929/csv_filename_1.csv" &&          \ # to simulate parallel's parameterization
    sed -E -e '1s/$/,filename/'             \ # append ",filename" to CSV header
           -e '2,\$s/$/,${fname}/' ${fname} \ # append the filename string to the content

But I can't get the sed to work with the second expression because I either get "${fname}" written as-is to the file, or the sed error "sed: -e expression #1, char 6: unknown command: '\'" complaining about a comma or the slash. I also have tried to group the expressions like -e '1{s/$/,filename/};2,\${s/$/,${fname}/}' for no avail.

Currently, I gave up sed and started trying with awk, but not knowing why it didn't work is bothering me, so I came to ask why and how to make it work.

Just one more piece of info regarding how I intend to run this thing. It would be something like

find ~/dataset -iname "*csv" -print0 | parallel -0 -j0 '<the whole command here (sed + gz)>'

How could I do this? What am I forgetting? Thanks, folks!

PS: I just got it with awk

awk -v d="csv_filename_1.csv" -F"," 'FNR==1{a="filename"} FNR>1{a=d} {print $0","a}' csv_filename_1.csv | less

"Something like?" Two comments: 1) "expression 1" char 6 would indict the first `-e` command. 2) However it's expression 2 that seemingly would generate a complaint about char 6. ie `sed -n '2,\$ p'` gets: sed: -e expression #1, char 6: unterminated address regex`. (Because there is no reason to escape the $ there) Net: if you want the sed explained, you need to show exactly what causes the error. — stevesliva, Nov 01 '21 at 17:58
Yep, there might be some copy-paste issues, since I got it from some intermediate try. I was really frustrated at that time. — paulochf, Nov 02 '21 at 03:11

score 3 · Accepted Answer · answered Nov 01 '21 at 17:25

3

This might work for you (GNU parallel and sed):

find . -type f -name '*.csv' | parallel sed -i \''1s/$/,filename/;1!s#$#,{}#'\' {}

Use find to deliver the filename to the parallel command.

Use sed to append ,filename to the heading of each file and the file name present in {} to each line in the file.

N.B. The use of alternative delimiters s#...#...# in the second sed command to allow for the filename slashes. Also the find should be executed in the dataset directory.

answered Nov 01 '21 at 17:25

potong

55,640
6
51
83

Wow! I was aware of the "!" operator but didn't think of using it that way. And regarding the alternative delimiter, that is just wow. Where can I read more about it? – paulochf Nov 02 '21 at 03:45
@paulochf enter `[sed]` in search box of stackoverflow and choose `learn more` – potong Nov 02 '21 at 11:43

gnu parallel + sed to edit both csv header and contents

1 Answers1