1

Working on something similar to Solr's WordDelimiterFilter, but not in Java.

Want to split words into tokens like this:

P90X                 = P, 90, X (split on word/number boundary)

TotallyCromulentWord = Totally, Cromulent, Word (split on lowercase/uppercase boundary)

TransAM              = Trans, AM

Looking for a general solution, not specific to the above examples. Preferably in a regex flavour that doesn't support lookbehind, but I can use PL/perl if necessary, which can do lookbehind.

Found a few answers on SO, but they all seemed to use lookbehind.

Things to split on:

  1. Transition from lowercase letter to upper case letter
  2. Transition from letter to number or number to letter
  3. (Optional) split on a few other characters (- _)

My main concern is 1 and 2.

Neil McGuigan
  • 46,580
  • 12
  • 123
  • 152
  • Can you show your expected output? – anubhava Aug 22 '14 at 18:09
  • 1
    Without looking at your link, it would be better to list _all_ the rules that apply to the segregation of your continuous letters. –  Aug 22 '14 at 18:09
  • Thank you for telling us. Good luck with your search. – tcooc Aug 22 '14 at 18:13
  • A major issue with this question is that you're asking for one regex that does four different things;and can somehow differentiate between an ancronym (IBM) and simply an uppercased word "TransAM". This is a broad problem space to be in. – George Stocker Aug 22 '14 at 18:16
  • The first 3 can be done in a single regex. The last one (acronyms) would have to be done after, or if a known list is available, can be done at once. –  Aug 22 '14 at 18:17

2 Answers2

2

That's not something I'd like to do without lookbehind, but for the challenge, here is a javascript solution that you should be able to easily convert into whatever language:

function split(s) {
    var match;
    var result = [];
    while (Boolean(match = s.match(/([A-Z]+|[A-Z]?[a-z]+|[0-9]+|([^a-zA-Z0-9])+)$/))) {
        if (!match[2]) {
            //don't return non alphanumeric tokens
            result.unshift(match[1]);
        }
        s = s.substring(0, s.length - match[1].length);
    }
    return result;
}

Demo:

P90X [ 'P', '90', 'X' ]
TotallyCromulentWord [ 'Totally', 'Cromulent', 'Word' ]
TransAM [ 'Trans', 'AM' ]
URLConverter [ 'URL', 'Converter' ]
Abc.DEF$012 [ 'Abc', 'DEF', '012' ]
Volune
  • 4,324
  • 22
  • 23
0

This regex should split into tokens all the words in a paragraph, or string.
Even works for the simple case in you're example.

Match globally. Also, if you want to add other specific delimiters that can be done as well.

   # /(?:[A-Z]?[a-z]+(?=[A-Z\d]|[^a-zA-Z\d]|$)|[A-Z]+(?=[a-z\d]|[^a-zA-Z\d]|$)|\d+(?=[a-zA-Z]|[^a-zA-Z\d]|$))[^a-zA-Z\d]*|[^a-zA-Z\d]+/

   (?:
        [A-Z]? [a-z]+ 
        (?= [A-Z\d] | [^a-zA-Z\d] | $ )
     |  
        [A-Z]+ 
        (?= [a-z\d] | [^a-zA-Z\d] | $ )
     |  
        \d+ 
        (?= [a-zA-Z] | [^a-zA-Z\d] | $ )
   )
   [^a-zA-Z\d]* 
|  
   [^a-zA-Z\d]+