63

I'm trying to work out a way of splitting up a string in java that follows a pattern like so:

String a = "123abc345def";

The results from this should be the following:

x[0] = "123";
x[1] = "abc";
x[2] = "345";
x[3] = "def";

However I'm completely stumped as to how I can achieve this. Please can someone help me out? I have tried searching online for a similar problem, however it's very difficult to phrase it correctly in a search.

Please note: The number of letters & numbers may vary (e.g. There could be a string like so '1234a5bcdef')

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • I haven't tried anything yet - I don't even know where to begin with the problem as it's the first time I've come across anything quite like it. –  Nov 25 '11 at 14:57
  • Users are asked to add a "homework" tag to all questions regarding homework problems. – Michael Nov 25 '11 at 14:58
  • 1
    @Michael this isn't a 'homework' question. I have just never come across this sort of problem before. –  Nov 25 '11 at 15:00
  • This is not a 'homework' question there are cases where you need to do this. – Dylan Dec 02 '19 at 00:58
  • I came her because I was looking for a similar solution. In my case I get back a long string of "rules" and have to split them before performing a lookup. – Captain Kenpachi Feb 24 '21 at 10:20

8 Answers8

108

You could try to split on (?<=\D)(?=\d)|(?<=\d)(?=\D), like:

str.split("(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)");

It matches positions between a number and not-a-number (in any order).

  • (?<=\D)(?=\d) - matches a position between a non-digit (\D) and a digit (\d)
  • (?<=\d)(?=\D) - matches a position between a digit and a non-digit.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Qtax
  • 33,241
  • 9
  • 83
  • 121
  • 4
    Just keep in mind, that this solution will threat characters, that are neither digit nor letter, as letters so you might want to verify your parts. – Mario Nov 25 '11 at 15:04
  • @TimPietzcker I wasn't the one down-voting this question - I've never seen this used in Java and was candidly asking for confirmation it works in Java. Now I'm even upvoting that. – Romain Nov 25 '11 at 15:18
  • 1
    Using `[a-zA-Z]` instead of `\\D` would guarantee matching actual text characters. Works for accented characters too (e.g. áü) – Agostino Mar 27 '15 at 22:32
  • 2
    `\D` works for accented characters, `[a-zA-Z]` does not. If you want to match unicode "letters" specifically, you could use `\p{L}` or `\p{L}\p{M}*`. – Qtax Feb 23 '16 at 12:38
11

How about:

private List<String> Parse(String str) {
    List<String> output = new ArrayList<String>();
    Matcher match = Pattern.compile("[0-9]+|[a-z]+|[A-Z]+").matcher(str);
    while (match.find()) {
        output.add(match.group());
    }
    return output;
}
nullpotent
  • 9,162
  • 1
  • 31
  • 42
10

You can try this:

Pattern p = Pattern.compile("[a-z]+|\\d+");
Matcher m = p.matcher("123abc345def");
ArrayList<String> allMatches = new ArrayList<>();
while (m.find()) {
    allMatches.add(m.group());
}

The result (allMatches) will be:

["123", "abc", "345", "def"]
The Anh Nguyen
  • 748
  • 2
  • 11
  • 27
5

Use two different patterns: [0-9]* and [a-zA-Z]* and split twice by each of them.

mishadoff
  • 10,719
  • 2
  • 33
  • 55
  • Thanks for your help on this. I'm not sure I fully understand what you mean. Please could you explain in a bit more detail or provide a basic example so I can see what you mean? –  Nov 25 '11 at 14:59
  • Semantically, it'd be `[0-9]+` and `[a-zA-Z]+`... Though they'll do the same. – Romain Nov 25 '11 at 15:00
  • firstly you split your string on digit pattern and get array of strings, after that you split string on letter pattern and get array of numbers. Concatenate two arrays you will get what you want – mishadoff Nov 25 '11 at 15:11
  • @mishadoff: You'd have to interleave the arrays, otherwise you get the elements in the wrong order. This is a needless complication that could easily be avoided by using a regex like the one Qtax suggested. – Tim Pietzcker Nov 25 '11 at 16:19
  • agree, Qtax solution is better. – mishadoff Nov 25 '11 at 16:39
4

If you are looking for solution without using Java String functionality (i.e. split, match, etc.) then the following should help:

List<String> splitString(String string) {
        List<String> list = new ArrayList<String>();
        String token = "";
        char curr;
        for (int e = 0; e < string.length() + 1; e++) {
            if (e == 0)
                curr = string.charAt(0);
            else {
                curr = string.charAt(--e);
            }

            if (isNumber(curr)) {
                while (e < string.length() && isNumber(string.charAt(e))) {
                    token += string.charAt(e++);
                }
                list.add(token);
                token = "";
            } else {
                while (e < string.length() && !isNumber(string.charAt(e))) {
                    token += string.charAt(e++);
                }
                list.add(token);
                token = "";
            }

        }

        return list;
    }

boolean isNumber(char c) {
        return c >= '0' && c <= '9';
    }

This solution will split numbers and 'words', where 'words' are strings that don't contain numbers. However, if you like to have only 'words' containing English letters then you can easily modify it by adding more conditions (like isNumber method call) depending on your requirements (for example you may wish to skip words that contain non English letters). Also note that the splitString method returns ArrayList which later can be converted to String array.

sergeyan
  • 1,173
  • 1
  • 14
  • 28
2

Didn't use Java for ages, so just some pseudo code, that should help get you started (faster for me than looking up everything :) ).

 string a = "123abc345def";
 string[] result;
 while(a.Length > 0)
 {
      string part;
      if((part = a.Match(/\d+/)).Length) // match digits
           ;
      else if((part = a.Match(/\a+/)).Length) // match letters
           ;
      else
           break; // something invalid - neither digit nor letter
      result.append(part);
      a = a.SubStr(part.Length - 1); // remove the part we've found
 }
Mario
  • 35,726
  • 5
  • 62
  • 78
1

I was doing this sort of thing for mission critical code. Like every fraction of a second counts because I need to process 180k entries in an unnoticeable amount of time. So I skipped the regex and split altogether and allowed for inline processing of each element (though adding them to an ArrayList<String> would be fine). If you want to do this exact thing but need it to be something like 20x faster...

void parseGroups(String text) {
    int last = 0;
    int state = 0;
    for (int i = 0, s = text.length(); i < s; i++) {
        switch (text.charAt(i)) {
            case '0':
            case '1':
            case '2':
            case '3':
            case '4':
            case '5':
            case '6':
            case '7':
            case '8':
            case '9':
                if (state == 2) {
                    processElement(text.substring(last, i));
                    last = i;
                }
                state = 1;
                break;
            default:
                if (state == 1) {
                    processElement(text.substring(last, i));
                    last = i;
                }
                state = 2;
                break;
        }
    }
    processElement(text.substring(last));
}
Tatarize
  • 10,238
  • 4
  • 58
  • 64
1

Wouldn't this "\d+|\D+" do the job instead of the cumbersome: "(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)" ?

Andrew Anderson
  • 1,044
  • 3
  • 17
  • 26