2

I am trying to use perl regexp to normalize search strings in a search log against a library database. I need to remove all digit occurrences, would be:

s/\d*//g 

except when I have a birth date like 1964- or a lifetime like 1903-1970 or 1903-70. How do I do that?

Nimantha
  • 6,405
  • 6
  • 28
  • 69

3 Answers3

1

You could use lookaround assertions.

For example, the following pattern

/\b(?<!-)\d+(?!-)\b/

would match a number like 42 or 1970 but not match:

  • 1964-
  • 1903-1970
  • 1903-70

For example, given an input:

42 foo 123 1964- 1903-1970 456 bar 1970

using the above regex to remove the intended strings:

$ echo 42 foo 123 1964- 1903-1970 456 bar 1970 | perl -pe 's/\b(?<!-)\d+(?!-)\b//g'
 foo  1964- 1903-1970  bar
devnull
  • 118,548
  • 33
  • 236
  • 227
  • Exceptions: `-1979` or `1-2-3`. The spec is also unclear on if numbers should be stripped from w/i words, so the fact that yours wouldn't strip the trailing numbers in `bob123` could be a bad or good thing. – Miller Mar 15 '14 at 21:59
1

A complicated regex could solve this, for sure. However, I believe the easiest solution is to take advantage of one of regular expressions most powerful tools, namely greedy matching, and break this into two steps.

s{([-\d]+)}{my $num = $1; $num =~ /^(?:\d+-\d*|-+)$/ ? $num : ''}eg;

The LHS pulls any number and/or dashes. Then the RHS leaves them if they match the specific exception that you requested.

I like the two step solution because it's quicker to see what's happening, and also the regex is less fragile so it's easier to adjust it at a later time with less risk of introducing a bug. All you'd have to do is add any additional exceptions you'd want to the RHS.

It is possible to duplicate the above using just the LHS by adding a lot of boundary conditions that mirror the effect of greedy matching. The below demonstrates that:

s{
    (?<![-\d])     # Start Boundary Condition to Enforce Greedy Matching
    (?!
        (?:          # Old RHS: List of expressions we don't want to match
            \d+-\d*
        |
            -+
        )
        (?![-\d])   # End Boundary Condition to Enforce Greedy Matching
    )
    ([-\d]+)      # Old LHS: What we want to match
    (?![-\d])     # End Boundary Condition to Enforce Greedy Matching
}{}xg;
Miller
  • 34,962
  • 4
  • 39
  • 60
0

Did you mean to replace all the digits except the digits are in format of 1000- or 1000-90?

Try this one:

(?<!\d)(?<!-)\d+(?!-\d*)(?!\d)
Sabuj Hassan
  • 38,281
  • 14
  • 75
  • 85
  • Your regex could be simplified a little by using a character class for the beginning negative lookahead assertions: `(?<![\d-])` However, you're regex will fail on cases such as `-1979` or `1-2-3`. – Miller Mar 15 '14 at 21:55