0

Yesterday I discover Sed and it's amazing. I can handle certain easy regex expressions and literals but I'm not sure how to only remove spaces that are NOT between two letters (a-zA-Z).

For example:

Input:

"Mal                        ","","Mr    ","123","  ","   Lauren Hills","Dr  ","  ","      ","        ",

Output:

"Mal","","Mr","123","","Lauren Hills","Dr","","","",

So far I've tried adapting commands that I found here, here and here.

The closest I've got is:

sed 's/ \{1,\}//g' test.csv > test.bak

which removes the significant spaces between words, like the space between Lauren and Hills.

Altimus Prime
  • 2,207
  • 2
  • 27
  • 46

5 Answers5

5

Easier in Perl than sed:

perl -pe 's/\B | \B//g' < input > output

\B stands for "not at word boundary", i.e. it doesn't remove spaces that have letters before and after.

choroba
  • 231,213
  • 25
  • 204
  • 289
  • That is easier. I bet it's faster too because it's all in one step. The file this is getting applied to is 400GB. – Altimus Prime Sep 30 '17 at 17:58
  • I stand corrected. I have two instances, one sed and one perl running beside eachother and sed just flew past the perl instance for speed. The perl instance was started about a minute and a half before the sed instance, but sed is now twice as far along judging from the file size. – Altimus Prime Sep 30 '17 at 18:10
  • @AuntJamaima: sed being simpler is probably faster - at least for simpler tasks. – choroba Sep 30 '17 at 18:11
  • wrt speed: perl regexps (PCREs) are slower to evaluate than BREs or EREs (see https://swtch.com/~rsc/regexp/regexp1.html) and when you use perl you get the PCRE regexp engine whether your regex is a PCRE or not so you should in general expect perl to be slower than sed or awk when evaluating regexps. – Ed Morton Sep 30 '17 at 21:57
  • @EdMorton: Perl doesn't use PCRE, it uses its own engine, PCRE are only "Perl Compatible". – choroba Sep 30 '17 at 22:16
  • @choroba OK thanks for the correction. Then what I should have said is that perl regexps are slower. Idk for sure about PCREs since the article I referenced only talks about perl but I would be surprised if PCREs are faster than perl regexps otherwise why wouldn't perl just use them? – Ed Morton Sep 30 '17 at 22:25
  • @EdMorton: Maybe because Perl's regex engine has more features than PCRE? See [perlre](http://p3rl.org/perlre). – choroba Sep 30 '17 at 22:34
  • Beats me. I wonder who decides what subset of perl regex is enough to declare that subset perl-compatible. Oh well... my point was really about perl regexps anyway, mentioning PCRE was apparently a mis-step on my part. – Ed Morton Sep 30 '17 at 22:43
  • 1
    I think using `perl -pe 's/\B +| +\B//g'` and `sed -E 's/\B +| +\B//g'` would be faster... – Sundeep Oct 01 '17 at 05:30
1

Add " also in the pattern

sed -e 's/ \{1,\}"/"/g' -e 's/" \{1,\}/"/g' test.csv > test.bak

Explanation:

-e option is used to apply more than one sed operation

The first part replaces 1 or more space characters and a " with a single ".

The second part replaces " and 1 or more space characters by a single "

SO, it removes leading and trailing spaces within quotes.

Kaushik Nayak
  • 30,772
  • 5
  • 32
  • 45
  • You are welcome. Here I've reused ur pattern. You could also try just `+` after space for matching one or more occurence. It is simpler to read. – Kaushik Nayak Sep 30 '17 at 18:07
  • Obviously that's working with spaces before/after quotes specifically, not spaces not between letters in general so if this is what you wanted then you should edit your question to be accurate so others with the same question in future can find the answer while people with your original question don't come here by mistake. Also, if this IS all you wanted then there's [a far simpler solution](https://stackoverflow.com/a/46507434/1745001). – Ed Morton Sep 30 '17 at 22:00
1

Do it in three steps. One removes spaces when the character to the left is a letter and the character to the right is not, the next step does the opposite, and the final step removes spaces when both are not letters. The only combination we don't removeis when both surrounding characters are letters.

sed -e 's/\([a-z]\) \{1,\}\([^a-z]\)/\1\2/ig' -e 's/\([^a-z]\) \{1,\}\([a-z]\)/\1\2/ig' -e 's/\([^a-z]\) \{1,\}\([^a-z]\)/\1\2/ig' test.csv > test.bak
Barmar
  • 741,623
  • 53
  • 500
  • 612
  • Thank you. This gives an output like `"Mal","","Mr","123"," ","Lauren Hills","Dr"," "," "," ",` which is the answer to the actual question in the headline, while it doesn't match the desired input and output. – Altimus Prime Sep 30 '17 at 17:42
  • I added a step for when both characters are not letters. – Barmar Sep 30 '17 at 17:44
1

You can use this one too.

sed 's/" */"/g;s/ *"/"/g'
ctac_
  • 2,413
  • 2
  • 7
  • 17
0
$ sed 's/ *" */"/g' file
"Mal","","Mr","123","","Lauren Hills","Dr","","","",
Ed Morton
  • 188,023
  • 17
  • 78
  • 185