Replace within capture group using sed

Question

I'm trying to use word2vec in some text that contains phrase delimitations like

I <phrase>like green beans</phrase> in my tortillas.

Before feeding the text to word2vec I need the input to be:

I __like_green_beans__ in my tortillas.

I've been trying to use sed to do the replacement. By doing

sed -e 's@<phrase>\(.*\)</phrase>@__\1__@g' myfile.txt

I can get rid of the delimiter but I haven't found a way to replace the spaces within the capture group.

Any ideas if it is possible with sed?

This might be useful: [Replace multiple occurrences between two strings](https://stackoverflow.com/questions/48105521/replace-multiple-occurrences-between-two-strings). — PesaThe, Feb 12 '18 at 20:04
Thanks @PesaThe, I was able to get the result I wanted using the perl way described in there. — nbermudezs, Feb 12 '18 at 20:36

score 3 · Accepted Answer · answered Feb 12 '18 at 20:16

3

You can try this sed

sed -E ':A;s/(>[^ ]*) ([^<]*<)/\1_\2/;tA;s/<[/]*phrase>/__/g'

answered Feb 12 '18 at 20:16

ctac_

2,413
2
7
17

Not sure how this is gonna scale when running it in my entire text corpus but it gets the job done. Thanks :) – nbermudezs Feb 12 '18 at 20:35
For reference, this usage depends on GNU sed. For other variants of sed (notably the ones in BSD, macOS), you may need to separate this into multiple script segments, like this: `sed -E -e ':A' -e 's/(>[^ ]*) ([^<]*<)/\1_\2/;tA' -e 's/<[/]*phrase>/__/g'` – ghoti Feb 13 '18 at 16:07

score 2 · Answer 2 · answered Feb 12 '18 at 20:06

2

Using gnu-awk:

awk -v ORS= -v RS='<phrase>.*</phrase>' '1;
RT{gsub(/<\/?phrase>/, "___", RT); gsub(/ /, "_", RT); print RT}' file

I ___like_green_beans___ in my tortillas.

answered Feb 12 '18 at 20:06

anubhava

761,203
64
569
643

Replace within capture group using sed

2 Answers2