InputStreams are for reading bytes; Readers are for reading characters. So you should use a Reader obtained from Files.newBufferedReader, or use a FileReader or InputStreamReader.
Although Java uses surrogate pairs inside a String to represent emojis and many other types of Unicode characters, you don’t need to deal with surrogate pairs directly. Surrogate values only exist because many character values are too large for a Java char
type. If you read individual characters as int
values (for example, with the CharSequence.codePoints method), you will get whole character values every time, and you will never see or have to deal with a surrogate value.
As of this writing, emojis are defined by Unicode to be in the Emoticons block, part of the Supplemental Symbols and Pictographs block, and three legacy characters in the Miscellaneous Symbols block.
Thus, using a BufferedReader and traversing the character data with ints might look like this:
try (BufferedReader reader =
Files.newBufferedReader(Paths.get(filename), Charset.defaultCharset())) {
IntStream chars = reader.lines().flatMapToInt(String::codePoints);
chars.forEachOrdered(c -> {
if ((c >= 0x2639 && c <= 0x263b) ||
(c >= 0x1f600 && c < 0x1f650) ||
(c >= 0x1f910 && c < 0x1f930)) {
processEmoji(c);
}
});
}