8.4 The String Class
Handling character sequences is supported primarily by the String and String-Builder classes. This section discusses the String class that provides support for creating, initializing, and manipulating immutable character strings. The next section discusses support for mutable strings provided by the StringBuilder class (p. 464).
Internal Representation of Strings
The following character encoding schemes are commonly used for encoding character data on computers:
- LATIN-1: Also known as ISO 8859-1. LATIN-1 is a fixed-length encoding that uses 1 byte to represent a character in the range 00 to FF—that is, characters that can be represented by 8 bits. This encoding suffices for most Western European languages.
- UTF-16: This encoding scheme is a variable-length scheme that uses either 2 bytes or 4 bytes to represent a character in the range 0000 to 10FFFF. This encoding suffices for most languages in the world. However, the char type in Java only represents values in the UTF-16 range 0000 to FFFF—that is, characters that can be represented by 2 bytes.
Internally, the character sequence in a String object is stored as an array of byte. If all characters in the string can be stored as a single byte per character, they are all encoded in the array with the LATIN-1 encoding scheme—1 byte per character. If any character in the sequence requires more than 1 byte, they are all encoded in the array with the UTF-16 encoding scheme—2 bytes per character. To keep track of which encoding is used for the characters in the internal byte array, the String class has a private final encoding-flag field named coder which the string methods can consult to correctly interpret the bytes in the internal array. What encoding to use is detected when the string is created. With this strategy of compact strings, storage is not wasted as would be the case if all strings were to be encoded in the UTF-16 encoding scheme.
Leave a Reply