Java 7 String - substring complexity

Question

Until Java 6, we had a constant time substring on String. In Java 7, why did they decide to go with copying char array - and degrading to linear time complexity - when something like StringBuilder was exactly meant for that?

To avoid having a string with small length prevent garbage collection of an arbitrarily large `char[]`. — Mike Samuel, Apr 20 '13 at 17:56
Using `StringBuilder` should solve such a problem, isn't it? — anoopelias, Apr 20 '13 at 18:00
Using `StringBuilder` lets you work around the problem once you're aware that it exists. It doesn't fix memory leaks in existing code though. This change fixes memory leaks in existing code and, since buffer copies are usually hardware supported, ends up not costing linear time for any substrings that fit within one virtual memory page. — Mike Samuel, Apr 20 '13 at 19:03

score 26 · Accepted Answer · answered Apr 20 '13 at 18:21

Why they decided is discussed in Oracle bug #4513622 : (str) keeping a substring of a field prevents GC for object:

When you call String.substring as in the example, a new character array for storage is not allocated. It uses the character array of the original String. Thus, the character array backing the the original String can not be GC'd until the substring's references can also be GC'd. This is an intentional optimization to prevent excessive allocations when using substring in common scenarios. Unfortunately, the problematic code hits a case where the overhead of the original array is noticeable. It is difficult to optimize for both edges cases. Any optimization for space/size trade-offs are generally complex and can often be platform-specific.

There's also this note, noting that what once was an optimization had become a pessimization according to tests:

For a long time preparations and planing have been underway to remove the offset and count fields from java.lang.String. These two fields enable multiple String instances to share the same backing character buffer. Shared character buffers were an important optimization for old benchmarks but with current real world code and benchmarks it's actually better to not share backing buffers. Shared char array backing buffers only "win" with very heavy use of String.substring. The negatively impacted situations can include parsers and compilers however current testing shows that overall this change is beneficial.

ILMTitan · Answer 2 · 2013-04-20T18:19:41.320

9

If you have a long lived small substring of a short lived, large parent string, the large char[] backing the parent string will not be eligible for garbage collection until the small substring moves out of scope. This means a substring can take up much more memory than people expect.

The only time the Java 6 way performed significantly better was when someone took a large substring from a large parent string, which is a very rare case.

Clearly they decided that the tiny performance cost of this change was outweighed by the hidden memory problems caused by the old way. The determining factor is that the problem was hidden, not that there is a workaround.

edited Apr 20 '13 at 18:19

answered Apr 20 '13 at 18:07

ILMTitan

10,751
3
30
46

trim() takes a large substring from a large parent string and is used all the time. – Don Nov 25 '15 at 17:33
Encountering damaged performance to algorithms due to this (poor..) design decision is a common occurrence not a rare one. – WestCoastProjects Mar 31 '16 at 21:26

score 5 · Answer 3 · answered Feb 17 '14 at 13:53

5

This will impact the complexity of data structures like Suffix Arrays by a fair margin. Java should provide some alternate method for getting a part of the original string.

answered Feb 17 '14 at 13:53

Heisenberg

5,514
2
32
43

Alex · Answer 4 · 2015-10-18T04:56:45.320

5

It's just their crappy way of fixing some JVM garbage collection limitations.

Before Java 7, if we want to avoid the garbage collection not working issue, we can always copy the substring instead of keeping the subString reference. It was just an extra call to the copy constructor:

String smallStr = new String(largeStr.substring(0,2));

But now, we can no longer have a constant time subString. What a disaster.

edited Oct 18 '15 at 04:56

answered Apr 18 '15 at 05:29

Alex

2,915
5
28
38

This is completely true. It is many types of programs that had benefited from shared substring usage. The compilers and parsers are a good illustration of the *type* of operations that are most hurt: but the damage extends well beyond those specific types of programs. – WestCoastProjects Mar 31 '16 at 21:23
1

Is anyone aware of any 3rd party lib/code with a custom implementation of CharSequence (or something similar) that replicates the "old" substring behavior? I often have to process large CSV-like files (500+MB) and whenever I profile them I realize at least 10% of the processing time seems to be wasted in calls to Arrays.copyOfRange(). – Simon Berthiaume Feb 08 '18 at 02:56
1

@SimonBerthiaume the performance mistake is to create `String` instances in the first place, which does already bear unnecessary copy operations even before calling `substring`. Since every `CharsetDecoder`, including those encapsulated in a `Reader`, operates on `CharBuffer`, that's your starting point. And it's already the solution, as it implements `CharSequence`, so you can pass it to tools like the regex pattern matching engine, and has copying free `subSequence` and `slice` operations. You only need to create the final match result strings. Even simple `java.util.Scanner` works that way – Holger Mar 23 '20 at 11:00
now we could use `subSequence(startIndex: Int, endIndex: Int): CharSequence` – SL5net Mar 16 '21 at 16:04

score 1 · Answer 5 · answered Jun 09 '15 at 18:52

The main motivation, I believe, is the eventual "co-location" of String and its char[]. Right now they locate in a distance, which is a major penalty on cache lines. If every String owns its char[], JVM can merge them together, and reading will be much faster.

Java 7 String - substring complexity

5 Answers5

Linked

Related