3

Trying to wrap my head around look-ahead and look-behind in regex processing.

Let's assume I have a file listing PIDs and other things. I want to build a regex to match the PID format \d{1,5} but that also excludes a certain PID.

$myself = $$;
@file = `cat $FILE`;
@pids = grep /\d{1,5}(?<!$myself)/, @file;

In this regex I try to combine the digits match with the exclusion using a negative look-behind by using the (?<!TO_EXCLUDE) construct. This doesn't work.

Sample file:

456
789
4567
345
22743
root
bin
sys

Would appreciate if someone could point me in the right direction.

Also would be interested to find out if this negative look-behind would be the most efficient in this scenario.

emx
  • 1,295
  • 4
  • 17
  • 28
  • 1
    Can you post some sample data? Note that look-behind must have fixed-width, or it won't work. – nhahtdh Jun 18 '12 at 12:53
  • It would be helpful to know more about the file content than "PIDs and other things". – TLP Jun 18 '12 at 12:54
  • Actually the "file" comes from a directory listing of /proc (the idea being to get a list of running PIDs without my own) – emx Jun 18 '12 at 12:58

5 Answers5

6

"Look behind" really looks behind. So, you can check whether a PID is preceded by something, not whether it matches something. If you just want to exclude $$, you can be more straightforward:

@file = `cat $FILE`;
@pids = grep /(\d{1,5})/ && $1 ne $$, @file;
choroba
  • 231,213
  • 25
  • 204
  • 289
  • Interesting syntax, exactly what I wanted to do but didn't know this was possible like this. – emx Jun 18 '12 at 13:06
  • Reading your answer again, I realize I completely misunderstood the look-behing concept in regex. – emx Jun 18 '12 at 13:13
5

I've upvoted the choroba solution, just wanted to explain why your original approach didn't work.

See, the regex parser is a complicated beast: it suffers from internal struggle of trying to match as many symbols as possible - and trying to match at any cost. And the latter, well, usually wins. )

For example, let's analyze the following:

my $test_line = '22743';
my $pid = '22743';
print 'Matched?', "\n" if $test_line =~ /\d{1,5}(?<!$pid)/;
print $&, "\n";

Why did it print 'Matched', you may ask? Because that's what happened: first the engine tried to consume all the five numbers, then match the next subexpression - and failed (that was the point of negative lookbehind, wasn't it?)

If it was you, you've stopped already - but not the engine! It still feels that dark desire to match no-matter-what! So it takes the next possible quantifier - four instead of five - and now, of course, the lookbehind subexpression is destined to succeed. ) That's quite easy to check by examining what's printed by print $&;

Can it be solved yet within the realm of regular expressions? Yep, with so called atomics:

print 'No match for ya!', "\n" unless $test_line =~ /(?>\d{1,5})(?<!$pid)/;

But that's usually considered a dark magic, I guess. )

raina77ow
  • 103,633
  • 15
  • 192
  • 229
  • This is the magic of SO that not only you get answers, you also get dedicated nice people who take the time to explain in details what you haven't understood properly. Thank you very very much for this mind-opening explanation. – emx Jun 18 '12 at 13:25
  • Thank you for your kind words. ) – raina77ow Jun 18 '12 at 13:54
4

And if you are curious how it could be done with regex here are some examples:

/\b\d{1,5}+(?<!\b$pid)/

/\b\d{1,5}\b(?<!\b$pid)/

/\b(?!$pid\b)\d+/

/^(?!$pid$)\d+$/
Qtax
  • 33,241
  • 9
  • 83
  • 121
2

How's about:

chomp(@file);      # remove newlines that will otherwise mess things up
my @pids = grep /\d{1,5}/, @file;
my %pids = map { $_ => 1 }, @pids;

delete $pids{$$};  # delete one specific pid

@pids = keys %pids;

I.e. funnel the list of PIDs through a hash and delete the own PID. Needs to chomp the lines read from file to match the PID.

I feel pretty sure there's a module on CPAN that handles processes though.

ETA:

If you are reading the values from readdir as you mentioned in comments, something like this might be your best option (untested):

opendir my $dh, "/proc" or die $!;
my @pids;
while ( my $line = readdir $dh ) {     # iterate through directory content
    next unless $line =~ /^\d{1,5}$/;  # skip non-numbers
    next if $line == $$;               # skip own PID
    push @pids, $line;
}
TLP
  • 66,756
  • 10
  • 92
  • 149
  • Thanks. Elegant but maybe not the most CPU-efficient. There is indeed a [Proc::ProcessTable](http://search.cpan.org/dist/Proc-ProcessTable/ProcessTable.pm) on CPAN that does what I need. – emx Jun 18 '12 at 13:20
  • If there is a cpan module you can use, I would use that instead of trying to hack something together. – TLP Jun 18 '12 at 13:24
  • In this particular case I am looking for the most portable solution to my problem, as this will run on several machines with various OS where I do not have the capability to install modules beforehand. I've even [considered](http://stackoverflow.com/questions/11070596/embedding-a-module-within-a-perl-program/) inlining the module with my code. – emx Jun 18 '12 at 13:28
0

A slightly different way (I try to avoid @file = cat text.txt)

my @pids;
open my $fi, "<", "pids.txt";
while (<$fi>) {
   if (/(\d{1,5})/) {
      push @pids, $1 if $1 ne $$;
   }
}
close $fi;

print join(", ", @pids), "\n";

This is my second post to SO, I hope it's ok offering an alternate method.

  • Thanks for your contribution. I am trying to be as efficient as possible in this case so going through a while loop might not be as optimal as using the grep regex. – emx Jun 18 '12 at 13:21
  • @emx Reading a file line-by-line is actually more efficient than slurping the file into an array (like you did). – TLP Jun 18 '12 at 13:27
  • Absolutely right. The cat $file was just for simplicity's sake, I actually get the data using a readdir on /proc – emx Jun 18 '12 at 13:29
  • The back-ticks to run the `cat` invokes an external process which is senseless in this case. There's even an [award](http://partmaps.org/era/unix/award.html) for useless `cat`s. As TLP noted, let Perl read the file, line-by-line in this case. – JRFerguson Jun 18 '12 at 13:34
  • Right, this is the last time I simplify my code for the benefit of making my post easier to read. I don't have the `cat` in my actual code. I actually get the data by doing `open $dh, "/proc"` and `readdir $dh` – emx Jun 18 '12 at 13:47
  • @emx If someone does not understand `opendir` and `readdir`, I doubt their answer to your question would be of much use to you. When showing your [sscce](http://sscce.org/) you should not change its fundamental process. – TLP Jun 18 '12 at 14:23