-1

I need to extract customer IDs which are unique alphanumeric character sequences from text. They can contain digits only or digits and alphabetic characters or only alphabetic characters. We can assume that they are longer than 5 characters. They might be capitalized or not.

I thought about using a dictionary, if the character sequence is not a word in dictionary and a sequence longer than 5, it is a good candidate.

Any ideas or sample java code would help. Thanks

alan turing
  • 463
  • 4
  • 20
  • 1
    What does your text look like? What have you tried? – David Faber Jan 10 '15 at 15:51
  • They are short chat utterances. I tried regex based methods but it is hard to capture variability of all customer ids with them. I need to write many regex patterns, if I go in that way. – alan turing Jan 10 '15 at 15:54
  • I don't think it's possible to answer your question without seeing example data. – David Faber Jan 10 '15 at 16:14
  • As I understand, you need a method to distinguish a normal WORDS from letter identifiers like DRWOS? I think, you can detect IDs by impossible sequences of letters, f.e. ZMY, SPM, or other. You can use the dictionary of probability 3-letters' sequences instead of the dictionary of words. – Mark Shevchenko Jan 10 '15 at 16:36
  • I took a shot at the regex part for what it's worth. – David Faber Jan 11 '15 at 04:34
  • You should provide an excerpt of the text. – Regular Jo Jan 11 '15 at 04:56
  • For example, ACC35454656D, 34566899, KHL343943986, KL235456695NYU, FBRYUSJ. There are 100k IDs with different styles.There is no single pattern. – alan turing Jan 11 '15 at 09:59
  • 1
    But what other data does the text contain that is interfering? – Regular Jo Jan 11 '15 at 17:30

1 Answers1

1

Here is a simple regular expression that will match alphanumeric sequences of 6 characters or more:

(?<![A-Za-z0-9])[A-Za-z0-9]{6,}

I used a negative lookbehind here instead of a word boundary (\b) in case there were underscores in your text. If your regex flavor doesn't have lookbehind then you'll want to use the word boundary instead (but I note now that you mentioned java in your question - and java does have lookbehind).

If the customer ID must contain a number, then a regular expression to match these would look like this:

(?<![A-Za-z0-9])(?=[A-Za-z]*[0-9][A-Za-z0-9]*)[A-Za-z0-9]{6,}

See Regex101 demo.

Is there a limit to how long your customer IDs can be? If so, then putting that limit in would probably be helpful - any alphanumeric character sequence longer than that number obviously won't be a match. If the limit is 25 characters, for example, the regex would look like this:

(?<![A-Za-z0-9])(?=[A-Za-z]*[0-9][A-Za-z0-9]*)[A-Za-z0-9]{6,25}(?![A-Za-z0-9])

(I added the lookahead at the end, otherwise this could simply match the first 25 characters of a long alphanumeric sequence!)

Once you have the matches extracted from your text, then you could do a dictionary lookup. I know there are questions and answers on StackOverflow on this subject.

To actually use this regex in Java, you would use the Pattern and Matcher classes. For example,

String mypattern = "(?<![A-Za-z0-9])(?=[A-Za-z]*[0-9][A-Za-z0-9]*)[A-Za-z0-9]{6,25}(?![A-Za-z0-9])";
Pattern tomatch = Pattern.compile(mypattern);

Etc. Hope this helps.

UPDATE

This just occurred to me, rather than trying a dictionary match, it might be better to store the extracted values in a database table and then compare that against your customers table.

David Faber
  • 12,277
  • 2
  • 29
  • 40