7

String class has some methods that i cannot understand why they were implemented like this... replace is one of them.

public String replace(CharSequence target, CharSequence replacement) {
    return Pattern.compile(target.toString(), Pattern.LITERAL).matcher(
            this).replaceAll(Matcher.quoteReplacement(replacement.toString()));
}

Are there some significant advantages over a simpler and more efficient (fast!) method?

public static String replace(String string, String searchFor, String replaceWith) {

    StringBuilder result=new StringBuilder();

    int index=0;
    int beginIndex=0;
    while((index=string.indexOf(searchFor, index))!=-1){
        result.append(string.substring(beginIndex, index)+replaceWith);
        index+=searchFor.length();
        beginIndex=index;
    }
    result.append(string.substring(beginIndex, string.length()));

    return result.toString();

}

Stats with Java 7:
1,000,000 iterations
replace "b" with "x" in "a.b.c"
result: "a.x.c"

Times:
string.replace: 485ms
string.replaceAll: 490ms
optimized replace = 180ms

Code like the Java 7 split method is heavily optimized to avoid pattern compile / regex processing when possible:

public String[] split(String regex, int limit) {
    /* fastpath if the regex is a
     (1)one-char String and this character is not one of the
        RegEx's meta characters ".$|()[{^?*+\\", or
     (2)two-char String and the first char is the backslash and
        the second is not the ascii digit or ascii letter.
     */
    char ch = 0;
    if (((regex.value.length == 1 &&
         ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
         (regex.length() == 2 &&
          regex.charAt(0) == '\\' &&
          (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
          ((ch-'a')|('z'-ch)) < 0 &&
          ((ch-'A')|('Z'-ch)) < 0)) &&
        (ch < Character.MIN_HIGH_SURROGATE ||
         ch > Character.MAX_LOW_SURROGATE))
    {
        int off = 0;
        int next = 0;
        boolean limited = limit > 0;
        ArrayList<String> list = new ArrayList<>();
        while ((next = indexOf(ch, off)) != -1) {
            if (!limited || list.size() < limit - 1) {
                list.add(substring(off, next));
                off = next + 1;
            } else {    // last one
                //assert (list.size() == limit - 1);
                list.add(substring(off, value.length));
                off = value.length;
                break;
            }
        }
        // If no match was found, return this
        if (off == 0)
            return new String[]{this};

        // Add remaining segment
        if (!limited || list.size() < limit)
            list.add(substring(off, value.length));

        // Construct result
        int resultSize = list.size();
        if (limit == 0)
            while (resultSize > 0 && list.get(resultSize - 1).length() == 0)
                resultSize--;
        String[] result = new String[resultSize];
        return list.subList(0, resultSize).toArray(result);
    }
    return Pattern.compile(regex).split(this, limit);
}

Following the logic of the replace method:

public String replaceAll(String regex, String replacement) {
    return Pattern.compile(regex).matcher(this).replaceAll(replacement);
}

The split implementation should be:

public String[] split(String regex, int limit) {
    return Pattern.compile(regex).split(this, limit);
}

The performance losses are not far from the ones found on the replace methods. For some reason Oracle gives a fastpath approach on some methods and not others.

marcolopes
  • 9,232
  • 14
  • 54
  • 65
  • 4
    "What are the reasons of the implementation of the Java native method?" <-- Ask the Java team? – George Stocker Jun 09 '14 at 13:57
  • their `replace()` uses their `replaceAll()`. What is wrong there? Why duplicating the code for replacement? – Arnaud Denoyelle Jun 09 '14 at 13:58
  • replace method does efficient replaceAll (that is, without involving regex) – RokL Jun 09 '14 at 14:16
  • Same performance issue (slow), but i guess there is a design reason to prefer a more generic approach at performance cost that will not be relevant in most of the cases... – marcolopes Jun 09 '14 at 14:25
  • Can we have some stats ? This might turn into a lesson of Automata theory ? How about the complexity of both the methods ? – bsd Jun 09 '14 at 16:28
  • My guess would be that *they wanted regex functionality*. `String.replace` allows developers to do regex replacement without all the boilerplate that comes with building a regular expression by hand (`Pattern`, `Matcher`, etc.). – Henry Keiter Jun 09 '14 at 16:58
  • 2
    @HenryKeiter - But `replace` doesn't do regex processing. – Hot Licks Jun 09 '14 at 17:00
  • @HotLicks ...huh. Right you are. – Henry Keiter Jun 09 '14 at 17:04
  • I bet Pattern.compile is optimized for the Pattern.LITERAL case. That, and you can essentially run Aho Corasick (great for repeated replacements) if you're using a Pattern. indexOf, not so much. – David Ehrmann Jun 09 '14 at 17:04
  • 2
    My guess is that at one point they were reworking all the hairy String functions to account for some change (possibly when replacing String with CharSequence), and it was easier/safer to use common code as much as possible rather than custom-coding each function. – Hot Licks Jun 09 '14 at 17:05

1 Answers1

8

Are you sure your proposed method is indeed faster than the regex-based one used by the String class - not just for your own test input, but for every possible input that a program might throw at it? It relies on String.indexOf to do substring matching, which is itself a naive implementation that is subject to bad worst-case performance. It's entirely possible that Pattern implements a more sophisticated matching algorithm such as KMP to avoid redundant comparisons.

In general, the Java team takes performance of the core libraries very seriously, and maintains lots of internal benchmarks using a wide range of real-world data. I've never encountered a situation where regex processing was a bottleneck. My standing advice is to start by writing the simplest possible code that works correctly, and don't even begin to think about rewriting the Java built-ins until profiling proves that it's a bottleneck, and you've exhausted all other avenues of optimization.

Regarding your latest edit - first, I would not describe the split method as heavily optimized. It handles one special case that happens to be extremely common and is guaranteed not to suffer from the poor worst-case complexity described above for the naive string matching algorithm - that of splitting on a single-character, literal token.

It may very well be that the same special case could be optimized for replace, and would provide some measurable improvement. But look what it took to achieve that simple optimization - about 50 lines of code. Those lines of code come at a cost, especially when they're a part of what's probably the most widely-used class in the Java library. Cost comes in many forms:

  • Resources - That's 50 lines of code that some developer must spend time writing, testing, documenting, and maintaining for the lifetime of the Java language.
  • Risk - That's 50 opportunities for subtle bugs that slip past the initial testing.
  • Complexity - That's 50 extra lines of code that any developer who wants to understand how the method works must now take time to read and understand.

Your question now boils down to "why was this one method optimized to handle a special case, but not the other one?" or even more generally "why was this particular feature not implemented?" Nobody but the original author can answer that definitively, but the answer is almost always that either there is not sufficient demand for that feature, or that the benefit derived from having the feature is deemed not worth the cost of adding it.

Alex
  • 13,811
  • 1
  • 37
  • 50
  • I've tested quite a few scenarios (large strings, many replace points, etc) and the difference is consistent, but as i said before, performance cost will not be relevant in most of the cases. It still puzzles me though, because i look at code from methods like SPLIT, and they are optimized to avoid Pattern compile / regex processing when not necessary... – marcolopes Jun 10 '14 at 01:53
  • 1
    @marcolopes Please see my edited answer - your question is very different now that it includes a comparison to the split method. – Alex Jun 10 '14 at 15:06
  • Your answer is conclusive. I will accept it. I believe a change in the question title is in order. – marcolopes Jun 10 '14 at 19:11