1

I obviously am missing something here. I have a web app where the input for a form may be in English or, after a keyboard switch, Russian. The meta tag for the page is specifying that the page is UTF-8. That does not seem to matter.

If I type in "вв", two of the unicode character: CYRILLIC SMALL LETTER VE

What do I get? A string. I call getCodePoints().toArray() and I get:

 [208, 178, 208, 178]

If I call chars().toArray[], I get the same.

What the heck?

I am completely in control of the web page, but of course there will be different browsers. But how can I get something back from the web page that will let me get the proper cyrillic characters?

This is on java 1.8.0_312. I can upgrade some, but not all the way to the latest java.

The page is this:

 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
 <html>
   <head>
     <title>Cards</title>
     <link rel = "stylesheet" href = "https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity = "sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin = "anonymous" />
     <link rel = "stylesheet" href = "https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap-theme.min.css" integrity = "sha384-rHyoN1iRsVXV4nD0JutlnGaslCJuC7uwjduW9SVrLvRYooPp2bWYgmgJQIXwl/Sp" crossorigin = "anonymous" />
     <script src = "https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity = "sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin = "anonymous">
     </script>
     <meta http-equiv = "Content-Type" content = "text/html; charset=UTF-8" />
     <style>.table-nonfluid { width: auto !important; }</style>
   </head>
   <body>
     <div style = "padding: 25px 25px 25px 25px;">
       <h2 align = "center">Cards</h2>
       <div style = "white-space: nowrap;">
         <a href="/cgi-bin/WebObjects/app.woa/wo/ee67KCNaHEiW1WdpdA8JIM/2.1">Home</a>
         <div>
   <form name="f_3_1" method="post" action="/cgi-bin/WebObjects/app.woa/wo/ee67KCNaHEiW1WdpdA8JIM/2.3.1">
     <table class = "table" border = "1" style = "max-width: 50%; font-size: 300%; text-align: center;">
           <tr>
             <td>to go</td>
           </tr>
           <tr>
             <td><input size="25" type="text" name="3.1.5.3.3" /></td>
           </tr>
           <td>
             <input type="submit" value="Submit" name="3.1.5.3.5" />&nbsp;&nbsp;<a href="/cgi-bin/WebObjects/app.woa/wo/ee67KCNaHEiW1WdpdA8JIM/2.3.1.5.3.7">Skip</a>
           </td>
     </table>
   <input type="hidden" name="wosid" value="ee67KCNaHEiW1WdpdA8JIM" />
 </form>
 </div>
       </div>
     </div>
   </body>
 </html>

Hm. Well, here is at least part of the story.

I have this code:

    System.out.println("start: " + start);
    int[] points = start.chars().toArray();
    byte[] next = new byte[points.length];
    int idx = 0;
    System.out.print("fixed: ");
    for (int p : points) {
        next[idx] = (byte)(p & 0xff);
        System.out.print(Integer.toHexString(next[idx]) + " ");
        idx++;
    }
    System.out.println("");

The output is:

 start: вв
 fixed: ffffffd0 ffffffb2 ffffffd0 ffffffb2 

And the UTF-8 value for "В", in hex, is d0b2.

So, there it is. The question is, why is this not more easily accessible? Do I really have to put this together byte-pair by byte-pair?

If the string is already in UTF-8, as I think we can see it is, why does the codePoints() method not give us, you know, the codePoints?

Ok, so now I do:

 new String(next, StandardCharsets.UTF_8);

and I get the proper string. But it still seems strange that codePoints() gives me an IntStream, but if you use these things as int values, it is broken.

Ray Kiddy
  • 3,521
  • 3
  • 25
  • 32
  • 2
    I don't think it is a java or a backend issue. That being said you didn't provide much information - please show the form used, html code, js code that sends it as json and maybe does something to it, etc – J Asgarov Jan 10 '22 at 21:11
  • There is no JS that I am including. I am using bootstrap for ui, Adding the page html above. – Ray Kiddy Jan 10 '22 at 21:47
  • 2
    well then please show the html form and the part of code that gets it as controller. The information you have provided so far is not enough – J Asgarov Jan 10 '22 at 21:49
  • Whatever code or framework assigned a value to `start` is the problem. It should be a String with a length of 2. Not four UTF-8 bytes which each are turned into a char value. You should not have to do any manipulation of the values at all. – VGR Jan 11 '22 at 03:33

1 Answers1

0

It was a problem with the frameworks I was using. I thought I was setting the request and response content type to utf-8 but I was not.

Ray Kiddy
  • 3,521
  • 3
  • 25
  • 32