0

I'm trying to create a filter for social security numbers and have the following regex:

\b(?!000|666)[0-8][0-9]{2}-(?!00)[0-9]{2}-(?!0000)[0-9]{4}\b

The problem is that the regex also matches the following type of string in Spamassassin and I haven't been able to solve the problem.

18-007-08-9056-1462-2205

I would like it to match only if the SSN string is on its own. Examples:

18 007-08-9056 1462-2205
007-08-9056
xyz 007-08-9056
007-08-9056 xyz
Julian
  • 167
  • 6

3 Answers3

3

Your problem is that \b matches at the word boundary, and - is considered a word boundary. You can try something like this:

(?:^|[^-\d])((?!000|666)[0-8][0-9]{2}-(?!00)[0-9]{2}-(?!0000)[0-9]{4})(?:$|[^-\d])

Match will then be available in $1. You might be able to find more elegant solution based on your specific kind of input strings. (E.g. will the SSN always have whitespace around it? If so, you can use \s, etc.)

Ashton Wiersdorf
  • 1,865
  • 12
  • 33
  • You might want to use `(?:^|[-\d])` for the beginning expression and `(?:$|[-\d])` at the end to also specifically allow empty strings at either end. Your current expression requires there to be at least one (non-dash, non-digit) character on either side. – tripleee Jun 25 '19 at 16:32
  • @tripleee You're exactly right. Thank you for catching that! – Ashton Wiersdorf Jun 25 '19 at 16:37
  • 1
    Of course, I lost the negation, sorry about that - you want `[^-\d]` in both places. – tripleee Jun 25 '19 at 16:39
  • *sheepish look* that's what I get for copy-paste. Thanks. – Ashton Wiersdorf Jun 25 '19 at 16:44
  • **This answer's regex will match `a007-08-9056` and I don't know if that's desired.** A word boundary is defined by two characters or a character and a start-of-file or end-of-file, so you can't say "`-` is considered a word boundary" since it is just one character. It counts as a non-word character, so it is a word boundary if and only if it abuts a word character (a letter, number, or underscore; see [my answer](https://stackoverflow.com/a/60674839/519360) for a more detailed definition of `\b`). – Adam Katz Mar 13 '20 at 17:28
3

The \b assertion is a word boundary - it matches any location that transitions from a word character to a non-word character. Digits are word characters, and hyphens are not. To specify a whitespace boundary, you can use lookarounds:

(?<!\S)(?!000|666)[0-8][0-9]{2}-(?!00)[0-9]{2}-(?!0000)[0-9]{4}(?!\S)

This specifies that there is no non-space character before the pattern, and no non-space character after. The lookaround allows you to specify this while still matching at the beginning or end of the string.

Grinnz
  • 9,093
  • 11
  • 18
1
\b(?<![.-])(?!000|666)[0-8][0-9]{2}-(?!00)[0-9]{2}-(?!0000)[0-9]{4}\b(?![.-])

This is the same as your regex, but it also excludes surrounding dashes and dots (feel free to add to those character classes, but ensure that the dash (-) is always at the end or else it'll create a range).

\b matches a word break. You probably know this, but that means one side of it (either before or after but not both) must be a word character (a letter, number, or underscore) and the other side (either after or before but not both) must not be a word character (it may instead be a line break or nonexistent due to having reached the beginning/end of the string). You want this, but you want to exclude a few more things too. Therefore:

\b(?<![.-]) means that after the word break, check the previous character (if any). It must not match [.-] (a single character that is either dot or dash).

\b(?![.-]) means that after the word break, the next character (if any) must not match [.-].

When I say "if any" I am referring to the possibility that there is a line break, start of file, or end of file instead. Those will all satisfy these negative lookarounds.

See also this full regex explanation, with examples, at regex101

Adam Katz
  • 14,455
  • 5
  • 68
  • 83