4

When I split a string in python, adjacent space delimiters are merged:

>>> str = "hi              there"
>>> str.split()
['hi', 'there']

In Java, the delimiters are not merged:

$ cat Split.java
class Split {
    public static void main(String args[]) {
        String str = "hi              there";
        String result = "";
        for (String tok : str.split(" "))
            result += tok + ",";
        System.out.println(result);
    }
}
$ javac Split.java ; java Split
hi,,,,,,,,,,,,,,there,

Is there a straightforward way to get python space split semantics in java?

Andrew Prock
  • 6,900
  • 6
  • 40
  • 60

5 Answers5

8

String.split accepts a regular expression, so provide it with one that matches adjacent whitespace:

str.split("\\s+")

If you want to emulate the exact behaviour of Python's str.split(), you'd need to trim as well:

str.trim().split("\\s+")

Quote from the Python docs on str.split():

If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].

So the above is still not an exact equivalent, because it will return [''] for the empty string, but it's probably okay for your purposes :)

Niklas B.
  • 92,950
  • 18
  • 194
  • 224
  • Thank you. I was having a hard time figuring out the syntax of the regex in the documentation: http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#split%28java.lang.String%29 – Andrew Prock Apr 09 '12 at 21:38
  • @Andrew: A better place to learn about regexes is http://www.regular-expressions.info/ :) – Niklas B. Apr 09 '12 at 21:42
  • It wasn't regexs in general, but regexes in Java that was tripping me up. I had been trying " *", which was failing miserably. – Andrew Prock Apr 10 '12 at 02:50
  • @Andrew: That's in fact a regex problem, because that pattern matches the empty string. Splitting on the empty string gives you all the characters as separate matches in every regex implementation I know of :) – Niklas B. Apr 10 '12 at 12:12
  • Reading http://en.wikipedia.org/wiki/Regular_expression, it seems Java uses Perl-style REs, whereas most UNIX utilities use POSIX REs. – Andrew Prock Apr 10 '12 at 17:39
  • @Andrew: I can't quite follow you. This has *nothing* to do with PCRE vs. POSIX, the regex works exactly the same in all implementations. I also don't know what you mean by "it works in less and sed", because those don't have a notion of string splitting. The particular problem with ` \*` instead of ` +` is also present there, of course. [Observe](http://pastie.org/3763953). The problem is that the former matches the empty string, while the other requires *at least one space*. – Niklas B. Apr 10 '12 at 19:44
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/9937/discussion-between-andrew-p-and-niklas-b) – Andrew Prock Apr 10 '12 at 22:20
1

Use str.split("\\s+") instead. This will do what you need.

Eugene Retunsky
  • 13,009
  • 4
  • 52
  • 55
1

Java uses Regex to split.

so splitting on a single space will absolutely give you many array elements.

Python split, ltrims and rtrims and then takes runs of spaces into a single space when no parameter has been passed.

So it would more properly be

"my             string".trim().split("\\s+"); 
Mike McMahon
  • 7,096
  • 3
  • 30
  • 42
  • No, this is not quite the same for a string like `foo\tbar` – Niklas B. Apr 09 '12 at 21:19
  • right, but that's not when has been presented :) what has been presented is foo bar. not foo\tbar. :) – Mike McMahon Apr 09 '12 at 21:20
  • What was asked is the equivalent to `split()` in Python, which your example is not. – Niklas B. Apr 09 '12 at 21:20
  • uhhh "If the optional second argument sep is absent or None, runs of whitespace characters are replaced by a single space and leading and trailing whitespace are removed" so sorry, lets do this. rtrim, ltrim, and then " +" ;) which is exactly what has been stated. – Mike McMahon Apr 09 '12 at 21:23
  • Mike: Right, didn't think about the trimming part. Still, whitespace includes more than just the space character (ASCII `0x20`) – Niklas B. Apr 09 '12 at 21:24
  • and you are correct, it's ignorant to forget character encoding. Those are the kinds of bugs we want to avoid! :) – Mike McMahon Apr 09 '12 at 21:26
1

The problem with Niklas B.'s answer is that trim has its own definition of whitespace, i.e., anything with code up to '\u0020'. The following should get close enough to the Python version, including the fix for the empty string:

class TestSplit {

    private static final String[] EMPTY = {};

    private static String[] pySplit(String s) {
        s = s.replaceAll("^\\s+", "").replaceAll("\\s+$", "");
        if (s.isEmpty()) return EMPTY;
        return s.split("\\s+");
    }
}
simleo
  • 2,775
  • 22
  • 23
0

In java, String.split takes a regex. So you can do str.split(" +") to get python semantics.

fqsxr
  • 672
  • 6
  • 10
  • 1
    No, this is not quite the same for a string like `foo\tbar` – Niklas B. Apr 09 '12 at 21:16
  • @NiklasB. OK, I see. OP asked for split() not split(' ') – fqsxr Apr 09 '12 at 21:17
  • 1
    Well, the question specifically asked about "space delimiters" and for "space split semantics" as opposed to "whitespace delimiters" or "whitespace split semantics", so in an overly pedantic sense the whitespace comments are a bonus not part of the original question. – Andrew Prock Apr 10 '12 at 02:53