Here is a simple regular expression that will match alphanumeric sequences of 6 characters or more:
(?<![A-Za-z0-9])[A-Za-z0-9]{6,}
I used a negative lookbehind here instead of a word boundary (\b
) in case there were underscores in your text. If your regex flavor doesn't have lookbehind then you'll want to use the word boundary instead (but I note now that you mentioned java in your question - and java does have lookbehind).
If the customer ID must contain a number, then a regular expression to match these would look like this:
(?<![A-Za-z0-9])(?=[A-Za-z]*[0-9][A-Za-z0-9]*)[A-Za-z0-9]{6,}
See Regex101 demo.
Is there a limit to how long your customer IDs can be? If so, then putting that limit in would probably be helpful - any alphanumeric character sequence longer than that number obviously won't be a match. If the limit is 25 characters, for example, the regex would look like this:
(?<![A-Za-z0-9])(?=[A-Za-z]*[0-9][A-Za-z0-9]*)[A-Za-z0-9]{6,25}(?![A-Za-z0-9])
(I added the lookahead at the end, otherwise this could simply match the first 25 characters of a long alphanumeric sequence!)
Once you have the matches extracted from your text, then you could do a dictionary lookup. I know there are questions and answers on StackOverflow on this subject.
To actually use this regex in Java, you would use the Pattern
and Matcher
classes. For example,
String mypattern = "(?<![A-Za-z0-9])(?=[A-Za-z]*[0-9][A-Za-z0-9]*)[A-Za-z0-9]{6,25}(?![A-Za-z0-9])";
Pattern tomatch = Pattern.compile(mypattern);
Etc. Hope this helps.
UPDATE
This just occurred to me, rather than trying a dictionary match, it might be better to store the extracted values in a database table and then compare that against your customers table.