3

I have a file that contains lines like this

some thing <phrase>a phrase</phrase> some thing else <phrase>other stuff</phrase>

I need to replace all the spaces between <phrase> tags with an underscore. So basically I need to replace every space that falls between > and </ with an underscore. I've tried many different commands in sed, awk, and perl but haven't been able to get anything to work. Below are some of the commands I've tried.

sed 's@>\s+[</]@_@g'

perl -pe 'sub c{$s=shift;$s=~s/ /_/g;$s}s/>.*?[<\/]/c$&/ge'

sed 's@\(\[>^[<\/]]*\)\s+@\1_@g'

awk -v RS='\\[>^[<\]/]*\\]' '{ gsub(/\<(\s+)\>/, "_", RT); printf "%s%s", $0, RT }' infile

I've been looking at these 2 questions trying to modify the answers to use the characters I need.
sed substitute whitespace for dash only between specific character patterns

https://unix.stackexchange.com/questions/63335/how-to-remove-all-white-spaces-just-between-brackets-using-unix-tools

Can anyone please help?

gary69
  • 3,620
  • 6
  • 36
  • 50

6 Answers6

5

Don't use regular expressions to parse XML/HTML.

use warnings;
use 5.014;  # for /r modifier
use Mojo::DOM;

my $text = <<'ENDTEXT';
some thing <phrase>a phrase</phrase> some thing else <phrase>other stuff</phrase>
ENDTEXT

my $dom = Mojo::DOM->new($text);
$dom->find('phrase')->each(sub { $_->content( $_->content=~tr/ /_/r ) });
print $dom;

Output:

some thing <phrase>a_phrase</phrase> some thing else <phrase>other_stuff</phrase>

Update: Mojolicious even contains some sugar that allows smashing that code into a oneliner:

$ perl -Mojo -pe '($_=x($_))->find("phrase")->each(sub{$_->content($_->content=~tr/ /_/r)})' input.txt
haukex
  • 2,973
  • 9
  • 21
  • Thank you, I thought since it had a lot of free text mixed in with tags that a parser wouldn't work – gary69 Feb 09 '19 at 23:01
  • I was assuming the input file wasn't HTML because OP described it as being line based. – melpomene Feb 09 '19 at 23:01
  • 1
    it is line based, its not an xml/html file – gary69 Feb 09 '19 at 23:01
  • `Mojo::DOM` is luckily pretty liberal in what it accepts, as the example shows – haukex Feb 09 '19 at 23:04
  • 2
    @gary69 "free text" is just a text node in XML. It can contain anything, except for XML tags which are separate nodes. Like most HTML/XML parsers, Mojo::DOM allows you to get at the text nodes as well. – Grinnz Feb 10 '19 at 05:49
2

I need to replace every space that falls between > and </ with an underscore.

That won't actually do what you want because e.g. in

some thing <phrase>a phrase</phrase> some thing else <phrase>other stuff</phrase>
                  ^^^^^^^^^^^      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

the substrings "between > and </" cover more than you think (marked ^ above).

I think the most straightforward way to express your requirements in Perl is

perl -pe 's{>[^<>]*</}{ $& =~ tr/ /_/r }eg'

Here [^<>] is used to make sure that the matched substring cannot contain < or > (in particular, it cannot match other <phrase> tags).

If that's too readable, you can also do

perl '-pes;>[^<>]*</;$&=~y> >_>r;eg'
melpomene
  • 84,125
  • 8
  • 85
  • 148
  • 1
    See https://ideone.com/Oz6ckt. It should be `'s{.*?}{ $& =~ tr/ /_/r }eg'`. However, strings might have line breaks between tags, in the general case. – Wiktor Stribiżew Feb 09 '19 at 22:46
  • Thank you so much. For the input I'll be handling there won't be line breaks between tags. – gary69 Feb 09 '19 at 22:56
  • @melpomene.. I tried something like ````perl -lne ' s/(?=>)([^<>]+?)(?=<\/)/```` but it is not working... could you please help me?. – stack0114106 Feb 09 '19 at 23:39
1

Another Perl, replacing between the <phrase> tags

$ export a="some thing <phrase>a phrase</phrase> some thing else <phrase>other stuff</phrase>"

$ echo $a | perl -lne ' s/(?<=<phrase>)(.+?)(?=<\/phrase>)/$x=$1;$x=~s{ }{_}g;sprintf("%s",$x)/ge ;  print '
some thing <phrase>a_phrase</phrase> some thing else <phrase>other_stuff</phrase>

$

EDIT

Thanks @haukex, shortening further

$ echo $a | perl -lne ' s/(?<=<phrase>)(.+?)(?=<\/phrase>)/$x=$1;$x=~s{ }{_}g;$x/ge ;  print '
some thing <phrase>a_phrase</phrase> some thing else <phrase>other_stuff</phrase>

$
Mr Lister
  • 45,515
  • 15
  • 108
  • 150
stack0114106
  • 8,534
  • 3
  • 13
  • 38
  • IMHO `s/\K(.+?)(?=<\/phrase>)/` would be better, so the `` isn't part of the match. Also, what is `sprintf("%s",$x)` for? – haukex Feb 09 '19 at 23:39
  • @haukex, yes \K is fine, but ````(?=)```` is more readable here.. using sprintf(), you can replace back the matched string (actually $1=$& here) – stack0114106 Feb 09 '19 at 23:44
  • @haukex.. btw melphomene has done it cleverly using ````$& =~ tr/ /_/r```` and I like that – stack0114106 Feb 09 '19 at 23:46
  • "More readable" is of course a matter of opinion, my point was that it acts differently than `\K` - the equivalent would be `(?<=)`. `sprintf("%s",$x)` can be replaced by just `$x`. – haukex Feb 09 '19 at 23:54
  • @haukex.. yes you are right. I updated the answer.. and reg sprintf()... I wonder why I missed.. as I have been lots of %05d replacements in my projects and totally forgot that simple variable would do.. im really stuck!.. thanks for hammering it.. – stack0114106 Feb 10 '19 at 00:02
  • @haukex.. I think you can help me.. do you know why ````/(?=>)(.*?)(?=<\/)/```` works but not ````(?=>)([^<>]+)(?=<\/)```` – stack0114106 Feb 10 '19 at 00:11
  • The problem with the second one is again the lookahead instead of a lookbehind. Remember that lookarounds are zero-width, they don't cause the position of the matcher to change. `(?=>)` means "only match here if the next thing is a `>`", i.e. it overlaps with the next part of the match, but then the next thing is `[^<>]`, so the match is always impossible. (This overlap issue is what I meant in my first comment regarding `(?=)`.) `(?<=>)` means "only match here if the *previous* thing is `>`", so that lookbehind doesn't overlap with `[^<>]+` and it works. – haukex Feb 10 '19 at 09:18
  • @haukex.. thank you for the explanation..now I understand – stack0114106 Feb 10 '19 at 12:18
1

This might work for you (GNU sed):

sed -E 's/<phrase>|<\/phrase>/\n&/g;ta;:a;s/^([^\n]*(\n[^\n ]*\n[^\n]*)*\n[^\n]*) /\1_/;ta;s/\n//g' file

Delimit tags by inserting newlines. Iteratively substitute spaces between pairs of newlines with underscores. When there are no more matches, remove the introduced newlines.

potong
  • 55,640
  • 6
  • 51
  • 83
1

With GNU awk for multi-char RS and RT:

$ awk -v RS='</?phrase>' '!(NR%2){gsub(/\s+/,"_")} {ORS=RT}1' file
some thing <phrase>a_phrase</phrase> some thing else <phrase>other_stuff</phrase>
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
1

if your data in 'd' by gnu sed;

sed -E ':b s/<(\w+)>([^<]*)\s([^<]*)(<\/\1)/<\1>\2_\3\4/;tb' d