3

The other day I asked how to wrap values of the first line of a csv file with quotations. I was given this reply which worked great.

$ cat file.csv  
word1,word2,word3,word4,word5  
12345,12346,12347,12348,12349  

To put quotes around the items in the first line only:

$ sed '1 { s/^/"/; s/,/","/g; s/$/"/ }' file.csv  
"word1","word2","word3","word4","word5"  
12345,12346,12347,12348,12349 

I now need to test if the quotes exist around the values to eliminate chances of double quoting values.

Vituvo
  • 1,008
  • 1
  • 9
  • 29

5 Answers5

3

Change each of the substitutions to include optional quotes:

sed -E '1 { s/^"?/"/; s/"?,"?/","/g; s/"?$/"/ }' file.csv

I have added -E to enable extended mode, so that ? is understood to mean "0 or 1 match".

You could also keep on using basic mode (no -E) and replace each ? with either \{0,1\} (again, 0 or 1 match) or * (which matches 0 or more).

Tom Fenech
  • 72,334
  • 12
  • 107
  • 141
  • Tom this command suits me best. If the values were missing the quotes and if some of them had it, they were skipped so first lines like `word1,"word2","word3","word4",word5` or `word1,word2,word3,word4,word5` always became `"word1","word2","word3","word4","word5"`. You guys were very helpful. I surely appreciate all the help! – Vituvo Sep 18 '17 at 23:55
  • Alas, this fails if a quoted field has a comma. Eg. `one,"two,two",three` becomes `"one","two","two","three"`. – Heath Raftery Mar 10 '20 at 04:48
  • You're right, @Heath, and there are probably many other inputs that would cause it to fail, after all it's just a series of substitutions that aren't context-aware. The best way to process a CSV file is to use a tool that supports them, such as Python with its [`csv` module](https://docs.python.org/3/library/csv.html). – Tom Fenech Mar 10 '20 at 07:46
  • Funny, using Python's `csv` module (from within a shell script) is indeed what I ended up doing. I'll add my answer. – Heath Raftery Mar 10 '20 at 13:03
2

Regular expressions with sed and awk are subject to a seemingly never-ending series of edge cases that fail. Leveraging a csv library instead provides a great deal more robustness.

I found Python's library was the best choice because it's:

  1. widely available without onerous dependencies, with the exception of Python itself;
  2. not particular sensitive to the version of Python you use;
  3. lends itself to being embedded within a shell script; and
  4. is quite compact (a one-liner will do!).

Thus, my solution is along the lines of:

QUOTE_CSV_PY='import sys; import csv; csv.writer(sys.stdout, quoting=csv.QUOTE_ALL).writerows(csv.reader(sys.stdin))'
head -1 file.csv | python -c "$QUOTE_CSV_PY"; tail -n +2 file.csv

To break it down:

  • QUOTE_CSV_PY is a shell variable containing the Python one-liner commands
  • The Python commands simply import the standard sys and csv modules. It then creates a csv writer that writes to stdout with QUOTE_ALL set so all fields get quoted. It is fed a csv reader that reads from stdin.
  • head -1 sends the first line to the python interpreter for processing.
  • ; tail -n +2 waits until the processing is done and then just dumps out every line from number two onwards.
Heath Raftery
  • 3,643
  • 17
  • 34
1

This problem suits awk more than sed due to row/column processing:

awk 'BEGIN{FS=OFS=","} NR==1 {
   for (i=1; i<=NF; i++) {gsub(/^"|"$/, "", $i); $i = "\"" $i "\""}
} 1' file

"word1","word2","word3","word4","word5"
12345,12346,12347,12348,12349
  • Using gsub function we remove leading or trailing double quote, if it exists
  • Then we can safely wrap each cell in double quotes
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 1
    This one worked great. I even placed quotes at random values like so: `word1,word2,"word3",word4,"word5"` and the final output was always `"word1","word2","word3","word4","word5"` – Vituvo Sep 18 '17 at 23:49
0

Keep your existing working sed command, by removing all possible double quotes first:

sed '1 { s/"//g; s/^/"/; s/,/","/g; s/$/"/ }' file.csv 
George Vasiliou
  • 6,130
  • 2
  • 20
  • 27
  • This was close. If the values had quotes already, everything was left the same except the last value which had a space before the closing quotation like so: `"word1","word2","word3","word4","word5 "` – Vituvo Sep 18 '17 at 23:38
0

To test each answer I created three files:

file.csv

word1,word2,word3,word4,word5  
12345,12346,12347,12348,12349 

file2.csv

"word1","word2","word3","word4","word5"  
12345,12346,12347,12348,12349

file3.csv

"word1",word2,word3,"word4",word5  
12345,12346,12347,12348,12349

Then I created a bash script

#!/bin/bash  

sed -E '1 { s/^"?/"/; s/"?,"?/","/g; s/"?$/"/ }' file.csv > final.csv  
sed -E '1 { s/^"?/"/; s/"?,"?/","/g; s/"?$/"/ }' file2.csv > final2.csv  
sed -E '1 { s/^"?/"/; s/"?,"?/","/g; s/"?$/"/ }' file3.csv > final3.csv 

Then I looked at the final files and the first lines were perfect.

# cat final*.csv  

"word1","word2","word3","word4","word5"  
12345,12346,12347,12348,12349  
"word1","word2","word3","word4","word5"  
12345,12346,12347,12348,12349  
"word1","word2","word3","word4","word5"  
12345,12346,12347,12348,12349  
Vituvo
  • 1,008
  • 1
  • 9
  • 29