Why String, StringBuffer and StringBuilder classes use byte array instead of character array to store characters of a string?

Question

A byte cannot accommodate unicodes of characters from various languages of the world. So using a byte array we cannot have a string of different languages. Why these classes use byte array instead of character array ?

UPDATE:

class First
{
        public static void main(String[] args)
        {
                System.out.println();
                String s = "\u0935\u0902\u0926\u0947 \u092e\u093e\u0924\u0930\u092e\u094d";
                String s1 = "वंदे मातरम्";
                System.out.println(sb);
                System.out.println(sb1);
        }
}

I think above Strings take two bytes for each character. How they can be accommodated in one byte ?

score 6 · Accepted Answer · edited Jun 20 '20 at 09:12

The use of byte[] is an optimization introduced in Java 9. The goals/motivation of this change is described in JEP 254: Compact Strings.

Summary

Adopt a more space-efficient internal representation for strings.

Goals

Improve the space efficiency of the String class and related classes while maintaining performance in most scenarios and preserving full compatibility for all related Java and native interfaces.

Non-Goals

It is not a goal to use alternate encodings such as UTF-8 in the internal representation of strings. A subsequent JEP may explore that approach.

Motivation

The current implementation of the String class stores characters in a char array, using two bytes (sixteen bits) for each character. Data gathered from many different applications indicates that strings are a major component of heap usage and, moreover, that most String objects contain only Latin-1 characters. Such characters require only one byte of storage, hence half of the space in the internal char arrays of such String objects is going unused.

Description

We propose to change the internal representation of the String class from a UTF-16 char array to a byte array plus an encoding-flag field. The new String class will store characters encoded either as ISO-8859-1/Latin-1 (one byte per character), or as UTF-16 (two bytes per character), based upon the contents of the string. The encoding flag will indicate which encoding is used.

String-related classes such as AbstractStringBuilder, StringBuilder, and StringBuffer will be updated to use the same representation, as will the HotSpot VM's intrinsic string operations.

This is purely an implementation change, with no changes to existing public interfaces. There are no plans to add any new public APIs or other interfaces.

The prototyping work done to date confirms the expected reduction in memory footprint, substantial reductions of GC activity, and minor performance regressions in some corner cases.

score 3 · Answer 2 · answered Jan 10 '19 at 09:42

As an optimization some virtual machines implementations (such as OpenJDK 9 and up) store strings that consist only of ASCII-encodable characters in byte arrays, which saves roughly half of the space compared to using a char[].

And since String is often used for technical stuff (as opposed to natural language) the majority of String values in most programs fit that description (even when the code handles a language that doesn't use ASCII-encodeble such as Arabic or Japanese). HTML tags, logger IDs, debug output and similar things can always always use these compressed strings.

Since there's no (official, supported) way to actually access the raw data and all access needs to go through the methods, this does not usually cause any compatibility issues.

score 2 · Answer 3 · answered Jan 10 '19 at 09:43

A byte cannot accommodate unicodes of characters from various languages of the world. So using a byte array we cannot have a string of different languages.

Neither can a char as they're only 16-bits. You'd need an int for that. But an int per character feels like too wasteful.

Why these classes use byte array instead of character array ?

Before very few Strings are about words taken from a spoken language. They're almost all computer code, using exclusively ASCII characters, that can be coded in 7-bits. Using 16-bits or more than that per character feels very wasteful on memory. So instead, they code it in bytes, either ASCII if all characters are ASCII, or UTF-16 if some characters aren't. That saves memory when it can and stays good enough when it cannot.

Data about persons very often have national characters – Thorbjørn Ravn Andersen Dec 27 '21 at 14:34 — Thorbjørn Ravn Andersen, Dec 27 '21 at 14:34

score 0 · Answer 4 · answered Jan 23 '19 at 15:26

Actually the String class in Java 9 and Higher versions of Java can use 1-byte or 2-bytes of the byte array for each character based upon the contents of the string. There is a field in String.java

private final byte coder;

Which decides the encoding used by the characters in the String (LATIN1 or UTF16).

Why String, StringBuffer and StringBuilder classes use byte array instead of character array to store characters of a string?

4 Answers4

Summary

Goals

Non-Goals

Motivation

Description