I obviously am missing something here. I have a web app where the input for a form may be in English or, after a keyboard switch, Russian. The meta tag for the page is specifying that the page is UTF-8. That does not seem to matter.
If I type in "вв", two of the unicode character: CYRILLIC SMALL LETTER VE
What do I get? A string. I call getCodePoints().toArray() and I get:
[208, 178, 208, 178]
If I call chars().toArray[], I get the same.
What the heck?
I am completely in control of the web page, but of course there will be different browsers. But how can I get something back from the web page that will let me get the proper cyrillic characters?
This is on java 1.8.0_312. I can upgrade some, but not all the way to the latest java.
The page is this:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title>Cards</title>
<link rel = "stylesheet" href = "https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity = "sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin = "anonymous" />
<link rel = "stylesheet" href = "https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap-theme.min.css" integrity = "sha384-rHyoN1iRsVXV4nD0JutlnGaslCJuC7uwjduW9SVrLvRYooPp2bWYgmgJQIXwl/Sp" crossorigin = "anonymous" />
<script src = "https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity = "sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin = "anonymous">
</script>
<meta http-equiv = "Content-Type" content = "text/html; charset=UTF-8" />
<style>.table-nonfluid { width: auto !important; }</style>
</head>
<body>
<div style = "padding: 25px 25px 25px 25px;">
<h2 align = "center">Cards</h2>
<div style = "white-space: nowrap;">
<a href="/cgi-bin/WebObjects/app.woa/wo/ee67KCNaHEiW1WdpdA8JIM/2.1">Home</a>
<div>
<form name="f_3_1" method="post" action="/cgi-bin/WebObjects/app.woa/wo/ee67KCNaHEiW1WdpdA8JIM/2.3.1">
<table class = "table" border = "1" style = "max-width: 50%; font-size: 300%; text-align: center;">
<tr>
<td>to go</td>
</tr>
<tr>
<td><input size="25" type="text" name="3.1.5.3.3" /></td>
</tr>
<td>
<input type="submit" value="Submit" name="3.1.5.3.5" /> <a href="/cgi-bin/WebObjects/app.woa/wo/ee67KCNaHEiW1WdpdA8JIM/2.3.1.5.3.7">Skip</a>
</td>
</table>
<input type="hidden" name="wosid" value="ee67KCNaHEiW1WdpdA8JIM" />
</form>
</div>
</div>
</div>
</body>
</html>
Hm. Well, here is at least part of the story.
I have this code:
System.out.println("start: " + start);
int[] points = start.chars().toArray();
byte[] next = new byte[points.length];
int idx = 0;
System.out.print("fixed: ");
for (int p : points) {
next[idx] = (byte)(p & 0xff);
System.out.print(Integer.toHexString(next[idx]) + " ");
idx++;
}
System.out.println("");
The output is:
start: вв
fixed: ffffffd0 ffffffb2 ffffffd0 ffffffb2
And the UTF-8 value for "В", in hex, is d0b2.
So, there it is. The question is, why is this not more easily accessible? Do I really have to put this together byte-pair by byte-pair?
If the string is already in UTF-8, as I think we can see it is, why does the codePoints() method not give us, you know, the codePoints?
Ok, so now I do:
new String(next, StandardCharsets.UTF_8);
and I get the proper string. But it still seems strange that codePoints() gives me an IntStream, but if you use these things as int values, it is broken.