2

Can I use sed to replace a regex match with a transformation of a group back reference within the regex?

Problem

Suppose I want to replace strings of the form:

(@ -p <fqdn>)

There may be multiple such matches per line.

with:

<fqdn with dots replaced by underscores>

Example

com.xyz (@ -p com.abc.def) com.pqr.stu (@ -p com.ghi)

would become:

com.xyz com_abc_def com.pqr.stu com_ghi

Ideas

To start working towards a solution, consider:

$ sed 's|(@ -p \([^)]*\))|\1|' <<<"com.xyz (@ -p com.abc) com.pqr (@ -p com.ghi)"
com.xyz com.abc com.pqr com.ghi

This does the appropriate selection; however, now I still need to have the \1 portion transformed with s|\.|_|g.

Can anyone show how this can be done using sed?

My environment is bash 4.2.46(1)-release, CentOS 7.3.1611.

Notes:

  • I am adding this to an existing sed script, so I am very much preferring a sed solution rather than piping the result of my current sed script to another string processor such as awk. If there is no sed solution to this problem, then I will consider awk solutions next.
  • My question is specific to the pattern shown in the above example.
Steve Amerige
  • 1,309
  • 1
  • 12
  • 28

3 Answers3

3

If the target string only occurs once (per line of input), you can use the hold space to do the double replacement, like this:

Single replacement

#Copy input line to the hold space: A(@B)C -- A(@B)C
h

#Replace the target substring with (@) (a "marker" string): A(@)C -- A(@B)C 
s/(@ -p [^)]*)/(@)/

#Exchange the content of the pattern space and hold space: A(@B) -- A(@)C 
x

#Strip off anything except the target substring value: B -- A(@)C
s/.*(@ -p \([^)]*\)).*/\1/

#Modify the target substring as appropriate: B' -- A(@)C
y/./_/

#Append the content of the hold space back to the pattern space: B'\nA(@)C -- 
G

#Merge the lines, replacing the "marker" string with the processed value: AB'C
s/\(.*\)\n\(.*\)(@)/\2\1/

Sample output:

%echo "com.xyz (@ -p com.abc) com.pqr" | sed -f doublereplace.sed 
com.xyz com_abc com.pqr

Multiple replacements

The looped version will look like this:

#Loop label 
:start /(@/ { 
    #Copy input line to the hold space: A(@B)C -- A(@B)C
    h

    #Replace the target substring with (@) (a "marker" string): A(@)C -- A(@B)C 
    s/(@ -p [^)]*)/(@)/

    #Exchange the content of the pattern space and hold space: A(@B) -- A(@)C 
    x

    #Strip off anything except the target substring value: B -- A(@)C
    s/[^(]*(@ -p \([^)]*\)).*/\1/

    #Modify the target substring as appropriate: B' -- A(@)C
    y/./_/

    #Append the content of the hold space back to the pattern space: B'\nA(@)C -- 
    G

    #Merge the lines, replacing marker string with the processed value: AB'C
    s/\(.*\)\n\(.*\)(@)/\2\1/

    #Loop
    b start
}

Sample output:

%echo "com.xyz (@ -p com.abc.def) com.pqr.stu (@ -p com.ghi)" |
sed -f doublereplace.sed

com.xyz com_abc_def com.pqr.stu com_ghi

Hardened

A bit more reliable version might use newlines as separators/marker string:

#Loop label 
:start /(@ -p [^)]*)/ { 
    #Copy input line to the hold space: A(@B)C -- A(@B)C
    h

    #Replace the target substring with (@) (a "marker" string): A\nC -- A(@B)C 
    s/(@ -p [^)]*)/\n/

    #Exchange the content of the pattern space and hold space: A(@B)C -- A\nC 
    x

    #Isolate the first instance of a target substring to a separate line A\n(@B)\nC -- A\n\C 
    s/\((@ -p [^)]*)\)/\n\1\n/1

    #Strip off anything except the target substring value: B -- A\nC
    s/.*\n(@ -p \([^)]*\))\n.*/\1/

    #Modify the target substring as appropriate: B' -- A\nC
    y/./_/

    #Append the content of the hold space back to the pattern space: B'\nA\nC -- 
    G

    #Merge the lines, replacing marker string with the processed value: AB'C
    s/\(.*\)\n\(.*\)\n/\2\1/

    #Loop
    b start
}

That will allow for any incomplete @() constructs in the input data, like (@ t.i.m.e.s):

%echo "com.xyz (@ -p com.abc.def) fails (@ t.i.m.e.s) com.pqr.stu (@ -p com.ghi)" |
sed -f doublereplace.sed

com.xyz com_abc_def fails (@ t.i.m.e.s) com.pqr.stu com_ghi
zeppelin
  • 8,947
  • 2
  • 24
  • 30
  • Brilliant! I've never thought that hold space and pattern space could be used as this:). – Paul Aug 11 '17 at 11:04
  • 2
    @paul any sed script that involves hold space and/or pattern space can be written more clearly, efficiently, robustly, portably, etc. in awk so YMMV with the benefits of that knowledge :-). Good mental exercise but I'd hate to come across it if I was asked to debug a program! – Ed Morton Aug 11 '17 at 11:09
  • I like this, too. Yes, I would need to handle multiple matches per line. – Steve Amerige Aug 11 '17 at 11:09
  • 2
    @SteveAmerige if you need to handle multiple matches per line then you should state and show that in your question with at least 2 matches in your example. There's usually a difference when processing text between handling 1 vs N transformations per line. – Ed Morton Aug 11 '17 at 11:11
  • 1
    In my problem statement, you are correct in that I did not specifically mention the number of replacements per line that might be possible. I did not specify just 1. Nor did I specify more than one. I have now made it explicit. Sometimes it is very hard to ensure that what I thought of as something that would be understood is actually what is understood by others. So, your feedback is valued. I have updated the question to make my assumption explicit. – Steve Amerige Aug 11 '17 at 11:16
  • Thank you for the loop example. Upvoting. Note that I improved the above to allow for spaces before/after the -p by changing two lines: first change to `s/(@ *-p *[^)]*)/(@)/` and second change to: `s/[^(]*(@ *-p *\([^ )]*\) *).*/\1/`. Also, kudos for showing the sed script with comments! – Steve Amerige Aug 11 '17 at 12:28
  • Try adding some other string that starts with `(@` to your input, e.g. add `fails (@ t.i.m.e.s)` in the middle and you'll see it replaces those `.`s with an underscore and prints the line twice and leaves the `(` in the final real string `(com_ghi`. We could peel the onion for days and add 20 more single characters plus the batman symbol but this is simply not a job for sed. – Ed Morton Aug 11 '17 at 14:45
  • `@ t.i.m.e.s` (and other incomplete match) issues are easy to fix by using newlines as separators (see "Hardened" version). You can still break it by putting newlines in the input data, or "nesting" constructs, e.g. `"(@ -p D.A.T.A (@ -p a.b.c) D.A.T.A) r.t.y"`, but as long as you can assume your input data to not break these basic constraints, it should be pretty safe to use, even w/o resorting to [batman](https://unicode-table.com/en/1F987/) symbol. BTW the "nested constructs" example above will make the current AWK version break too. – zeppelin Aug 11 '17 at 17:24
2

You can use gnu awk:

s='com.xyz (@ -p com.abc.def) com.pqr.stu'
awk -v RS='\\(@ -p [^)]+\\)' '{
       ORS=gensub(/.* |\)/,"","g",gensub(/\./,"_","g",RT))} 1' <<< "$s"

com.xyz com_abc_def com.pqr.stu
anubhava
  • 761,203
  • 64
  • 569
  • 643
0

gawk solution:

str="com.xyz (@ -p com.abc.def) com.pqr.stu"
awk 'match($0, /\(@ -p ([^)]+)\)/, a){ "echo "a[1]" | tr \".\" \"_\"" | getline v; 
     sub(/\(@ -p ([^)]+)\)/,v, $0); print }' <<< $str

The output:

com.xyz com_abc_def com.pqr.stu
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105