4

I'm trying to use word2vec in some text that contains phrase delimitations like

I <phrase>like green beans</phrase> in my tortillas.

Before feeding the text to word2vec I need the input to be:

I __like_green_beans__ in my tortillas.

I've been trying to use sed to do the replacement. By doing

sed -e 's@<phrase>\(.*\)</phrase>@__\1__@g' myfile.txt 

I can get rid of the delimiter but I haven't found a way to replace the spaces within the capture group.

Any ideas if it is possible with sed?

nbermudezs
  • 2,814
  • 1
  • 20
  • 20
  • 1
    This might be useful: [Replace multiple occurrences between two strings](https://stackoverflow.com/questions/48105521/replace-multiple-occurrences-between-two-strings). – PesaThe Feb 12 '18 at 20:04
  • 1
    Thanks @PesaThe, I was able to get the result I wanted using the perl way described in there. – nbermudezs Feb 12 '18 at 20:36

2 Answers2

3

You can try this sed

sed -E ':A;s/(>[^ ]*) ([^<]*<)/\1_\2/;tA;s/<[/]*phrase>/__/g'
ctac_
  • 2,413
  • 2
  • 7
  • 17
  • Not sure how this is gonna scale when running it in my entire text corpus but it gets the job done. Thanks :) – nbermudezs Feb 12 '18 at 20:35
  • For reference, this usage depends on GNU sed. For other variants of sed (notably the ones in BSD, macOS), you may need to separate this into multiple script segments, like this: `sed -E -e ':A' -e 's/(>[^ ]*) ([^<]*<)/\1_\2/;tA' -e 's/<[/]*phrase>/__/g'` – ghoti Feb 13 '18 at 16:07
2

Using gnu-awk:

awk -v ORS= -v RS='<phrase>.*</phrase>' '1;
RT{gsub(/<\/?phrase>/, "___", RT); gsub(/ /, "_", RT); print RT}' file

I ___like_green_beans___ in my tortillas.
anubhava
  • 761,203
  • 64
  • 569
  • 643