Skipping particular positions in a string using substitution operator in perl

Question

Yesterday, I got stuck in a perl script. Let me simplify it, suppose there is a string (say ABCDEABCDEABCDEPABCDEABCDEPABCDEABCD), first I've to break it at every position where "E" comes, and secondly, break it specifically where the user wants to be at. But, the condition is, program should not cut at those sites where E is followed by P. For example there are 6 Es in this sequence, so one should get 7 fragments, but as 2 Es are followed by P one will get 5 only fragments in the output.

I need help regarding the second case. Suppose user doesn't wants to cut this sequence at, say 5th and 10th positions of E in the sequence, then what should be the corresponding script to let program skip these two sites only? My script for first case is:

my $otext = 'ABCDEABCDEABCDEPABCDEABCDEPABCDEABCD';

$otext=~ s/([E])/$1=/g; #Main cut rule.

$otext=~ s/=P/P/g;

@output = split( /\=/, $otext);

print "@output";

Please do help!

You should use code markdown to make code more readable similar to how I edited - you can click the question mark next to the text editor window to get markup help. — DVK, Aug 11 '12 at 12:48

DVK · Accepted Answer · 2012-08-11T13:19:16.903

To split on "E" except where it's followed by "P", you should use Negative look-ahead assertions.

From perldoc perlre "Look-Around Assertions" section:

(?!pattern)
A zero-width negative look-ahead assertion.
For example /foo(?!bar)/ matches any occurrence of "foo" that isn't followed by "bar".

my $otext = 'ABCDEABCDEABCDEPABCDEABCDEPABCDEABCD'; 
#                E    E    EP    E    EP    E
my @output=split(/E(?!P)/, $otext); 
use Data::Dumper; print Data::Dumper->Dump([\@output]);"

$VAR1 = [
          'ABCD',
          'ABCD',
          'ABCDEPABCD',
          'ABCDEPABCD',
          'ABCD'
        ];

Now, in order to NOT cut at occurences #2 and #4, you can do 2 things:

Concoct a really fancy regex that automatically fails to match on given occurence. I will leave that to someone else to attempt in an answer for completeness sake.

Simply stitch together the correct fragments.

I'm too brain-dead to come up with a good idiomatic way of doing it, but the simple and dirty way is either:

  my %no_cuts = map { ($_=>1) } (2,4); # Do not cut in positions 2,4
  my @output_final;
  for(my $i=0; $i < @output; $i++) {
      if ($no_cuts{$i}) {
          $output_final[-1] .= $output[$i];
      } else {
          push @output_final, $output[$i];
      } 
  }
  print Data::Dumper->Dump([\@output_final];

  $VAR1 = [
            'ABCD',
            'ABCDABCDEPABCD',
            'ABCDEPABCDABCD'
          ];

Or, simpler:

  my %no_cuts = map { ($_=>1) } (2,4); # Do not cut in positions 2,4
  for(my $i=0; $i < @output; $i++) {
      $output[$i-1] .= $output[$i]; 
      $output[$i]=undef; # Make the slot empty
  }
  my @output_final = grep {$_} @output; # Skip empty slots
  print Data::Dumper->Dump([\@output_final];

  $VAR1 = [
            'ABCD',
            'ABCDABCDEPABCD',
            'ABCDEPABCDABCD'
          ];

CodeClown42 · Answer 2 · 2012-08-11T14:05:58.840

0

Here's a dirty trick that exploits two facts:

normal text strings never contain null bytes (if you don't know what a null byte is, you should as a programmer: http://en.wikipedia.org/wiki/Null_character, and nb. it is not the same thing as the number 0 or the character 0).
perl strings can contain null bytes if you put them there, but be careful, as this may screw up some perl internal functions.

The "be careful" is just a point to be aware of. Anyway, the idea is to substitute a null byte at the point where you don't want breaks:

my $s = "ABCDEABCDEABCDEPABCDEABCDEPABCDEABCD";

my @nobreak = (4,9);

foreach (@nobreak) {
    substr($s, $_, 1) = "\0";
}

"\0" is an escape sequence representing a null byte like "\t" is a tab. Again: it is not the character 0. I used 4 and 9 because there were E's in those positions. If you print the string now it looks like:

ABCDABCDABCDEPABCDEABCDEPABCDEABCD

Because null bytes don't display, but they are there, and we are going to swap them back out later. First the split:

my @a = split(/E(?!P)/, $s);

Then swap the zero bytes back:

$_ =~ s/\0/E/g foreach (@a);

If you print @a now, you get:

ABCDEABCDEABCDEPABCD
ABCDEPABCD
ABCD

Which is exactly what you want. Note that split removes the delimiter (in this case, the E); if you intended to keep those you can tack them back on again afterward. If the delimiter is from a more dynamic regex it is slightly more complicated, see here:

http://perlmeme.org/howtos/perlfunc/split_function.html

"Example 9. Keeping the delimiter"

If there is some possibility that the @nobreak positions are not E's, then you must also keep track of those when you swap them out to make sure you replace with the correct character again.

edited Aug 11 '12 at 14:05

answered Aug 11 '12 at 13:31

CodeClown42

11,194
1
32
67

I don't think this is what OP wanted. You're avoiding splitting on **characters** # 4 and 9; OP wanted to avoid splitting on positions of "E" #4 and 9 – DVK Aug 11 '12 at 13:46
Well, in the first paragraph s/he says "break it specifically where the user wants to be at" but in the second "Suppose user doesn't wants to cut this sequence at, say 5th and 10th positions of E in the sequence". This will accomplish the latter: If you want to split on E except where indicated, replace where indicated with something that isn't E, then substitute back again after the split. I assumed the user would not indicate not to break somewhere unless there was actually a possible break there, and breaks were on E, so this meets those criteria, but I'll add a note... – CodeClown42 Aug 11 '12 at 14:04
@DVK: I tried your script, it was pretty useful. But I want user to enter those positions where the program skips to split, and secondly I want "E" at the end of output fragments where split actually happens. So, the output should look like: $VAR1 = [ 'ABCDE', 'ABCDABCDEPABCDE', 'ABCDEPABCDABCDE' ]; – prashant Aug 13 '12 at 05:17
@prashant - you can simply add "E" at the end in a `for` loop - not very neat but works easily. "I want user to enter those positions where the program skips to split" - Not sure what you mean? My program already does that. – DVK Aug 13 '12 at 10:34
@DVK: I added "E" in the for loop and it works fine now. But, I want user to enter those positions through command line( for ex. positions: 1,2) which program has to skip. – prashant Aug 14 '12 at 08:16
@goldilocks: I tried your script, but with some modifications. As I want user to enter those sequences where the program skips, I store all the input values in an array. But the problem is.. when I enter positions through perl command line, say I enter 4 and press enter, then I press 9 (again enter), but this process keep on happening. Array (@a) values are not printing as an output. Please do suggest where I have to correct my code. I am showing my script in next comment. – prashant Aug 17 '12 at 07:42
Scripts is- my $s = 'ABCDEABCDEABCDEPABCDEABCDEPABCDEABCD'; my @nobreak = ; foreach (@nobreak) { substr($s, $_, 1) = "\0"; } my @a = split(/E(?!P)/, $s); $_ =~ s/\0/E/g foreach (@a); print "@a\n"; Please do suggest some modifications. – prashant Aug 17 '12 at 07:46
@prashant: When I run your script, entering 4 and 9 on seperate lines then ctrl-d to close `STDIN` (on linux), it works. – CodeClown42 Aug 17 '12 at 15:03

Skipping particular positions in a string using substitution operator in perl

2 Answers2

Linked