Until Java 6, we had a constant time substring on String
. In Java 7, why did they decide to go with copying char
array - and degrading to linear time complexity - when something like StringBuilder
was exactly meant for that?

- 9,240
- 7
- 26
- 39
-
6To avoid having a string with small length prevent garbage collection of an arbitrarily large `char[]`. – Mike Samuel Apr 20 '13 at 17:56
-
Using `StringBuilder` should solve such a problem, isn't it? – anoopelias Apr 20 '13 at 18:00
-
4Using `StringBuilder` lets you work around the problem once you're aware that it exists. It doesn't fix memory leaks in existing code though. This change fixes memory leaks in existing code and, since buffer copies are usually hardware supported, ends up not costing linear time for any substrings that fit within one virtual memory page. – Mike Samuel Apr 20 '13 at 19:03
5 Answers
Why they decided is discussed in Oracle bug #4513622 : (str) keeping a substring of a field prevents GC for object:
When you call String.substring as in the example, a new character array for storage is not allocated. It uses the character array of the original String. Thus, the character array backing the the original String can not be GC'd until the substring's references can also be GC'd. This is an intentional optimization to prevent excessive allocations when using substring in common scenarios. Unfortunately, the problematic code hits a case where the overhead of the original array is noticeable. It is difficult to optimize for both edges cases. Any optimization for space/size trade-offs are generally complex and can often be platform-specific.
There's also this note, noting that what once was an optimization had become a pessimization according to tests:
For a long time preparations and planing have been underway to remove the offset and count fields from java.lang.String. These two fields enable multiple String instances to share the same backing character buffer. Shared character buffers were an important optimization for old benchmarks but with current real world code and benchmarks it's actually better to not share backing buffers. Shared char array backing buffers only "win" with very heavy use of String.substring. The negatively impacted situations can include parsers and compilers however current testing shows that overall this change is beneficial.

- 84,978
- 11
- 107
- 151
If you have a long lived small substring of a short lived, large parent string, the large char[] backing the parent string will not be eligible for garbage collection until the small substring moves out of scope. This means a substring can take up much more memory than people expect.
The only time the Java 6 way performed significantly better was when someone took a large substring from a large parent string, which is a very rare case.
Clearly they decided that the tiny performance cost of this change was outweighed by the hidden memory problems caused by the old way. The determining factor is that the problem was hidden, not that there is a workaround.

- 10,751
- 3
- 30
- 46
-
trim() takes a large substring from a large parent string and is used all the time. – Don Nov 25 '15 at 17:33
-
Encountering damaged performance to algorithms due to this (poor..) design decision is a common occurrence not a rare one. – WestCoastProjects Mar 31 '16 at 21:26
This will impact the complexity of data structures like Suffix Arrays by a fair margin. Java should provide some alternate method for getting a part of the original string.

- 5,514
- 2
- 32
- 43
It's just their crappy way of fixing some JVM garbage collection limitations.
Before Java 7, if we want to avoid the garbage collection not working issue, we can always copy the substring instead of keeping the subString reference. It was just an extra call to the copy constructor:
String smallStr = new String(largeStr.substring(0,2));
But now, we can no longer have a constant time subString. What a disaster.

- 2,915
- 5
- 28
- 38
-
This is completely true. It is many types of programs that had benefited from shared substring usage. The compilers and parsers are a good illustration of the *type* of operations that are most hurt: but the damage extends well beyond those specific types of programs. – WestCoastProjects Mar 31 '16 at 21:23
-
1Is anyone aware of any 3rd party lib/code with a custom implementation of CharSequence (or something similar) that replicates the "old" substring behavior? I often have to process large CSV-like files (500+MB) and whenever I profile them I realize at least 10% of the processing time seems to be wasted in calls to Arrays.copyOfRange(). – Simon Berthiaume Feb 08 '18 at 02:56
-
1@SimonBerthiaume the performance mistake is to create `String` instances in the first place, which does already bear unnecessary copy operations even before calling `substring`. Since every `CharsetDecoder`, including those encapsulated in a `Reader`, operates on `CharBuffer`, that's your starting point. And it's already the solution, as it implements `CharSequence`, so you can pass it to tools like the regex pattern matching engine, and has copying free `subSequence` and `slice` operations. You only need to create the final match result strings. Even simple `java.util.Scanner` works that way – Holger Mar 23 '20 at 11:00
-
now we could use `subSequence(startIndex: Int, endIndex: Int): CharSequence` – SL5net Mar 16 '21 at 16:04
The main motivation, I believe, is the eventual "co-location" of String
and its char[]
. Right now they locate in a distance, which is a major penalty on cache lines. If every String
owns its char[]
, JVM can merge them together, and reading will be much faster.

- 19,446
- 5
- 33
- 61