3

Say I have a string like this:

Output:   
I have some-non-alphanumeric % characters remain here, I "also, have_+ some & .here"

I want to only remove non-alphanumeric characters inside the quotations except commas, periods, or spaces:

Desired Output:    
I have some-non-alphanumeric % characters remain here, I "also, have some  .here"

I have tried the following sed command matching a string and deleting inside the quotes, but it deletes everything that is inside the quotes including the quotes:

sed '/characters/ s/\("[^"]*\)\([^a-zA-Z0-9\,\. ]\)\([^"]*"\)//g'

Any help is appreciated, preferably using sed, to get the desired output. Thanks in advance!

  • sed is not the right tool for this. What about Perl? Did you wanna perl solution? – Avinash Raj Jan 26 '15 at 02:54
  • Well I am adding this piece of code to an existing script to which I will pass on to other users... #!/bin/bash is my shell, so I don't think perl is beneficial here. – Nikolos Birks Jan 26 '15 at 13:59

2 Answers2

2

Sed is not the right tools for this. Here is the one through Perl.

perl -pe 's/[^a-zA-Z0-9,.\s"](?!(?:"[^"]*"|[^"])*$)//g' file

Example:

$ echo 'I have some-non-alphanumeric % characters remain here, I "also, have_+ some & .here"' | perl -pe 's/[^a-zA-Z0-9,.\s"](?!(?:"[^"]*"|[^"])*$)//g'
I have some-non-alphanumeric % characters remain here, I "also, have some  .here"

Regex Demo

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
2

You need to repeat your substitution multiple times to remove all non-alphanumeric characters. Doing such a loop in sed requires a label and use of the b and t commands:

sed '
# If the line contains /characters/, just to label repremove
/characters/ b repremove
# else, jump to end of script
b
# labels are introduced with colons
:repremove
# This s command says: find a quote mark and some stuff we do not want
# to remove, then some stuff we do want to remove, then the rest until
# a quote mark again. Replace it with the two things we did not want to
# remove
s/\("[a-zA-Z0-9,. ]*\)[^"a-zA-Z0-9,. ][^"a-zA-Z0-9,. ]*\([^"]*"\)/\1\2/
# The t command repeats the loop until we have gotten everything
t repremove
'

(This will work even without the [^"a-zA-Z0-9,. ]*, but it'll be slower on lines that contain many non-alphanumeric characters in a row)

Though the other answer is right in that doing this in perl is much easier.

Daniel Martin
  • 23,083
  • 6
  • 50
  • 70