1

My problem is fairly simple:

new InputStreamReader(is, "UTF-8");

Makes β and ・look like question marks.

Which encoder should I use to see those characters correctly?

Andrey Ermakov
  • 3,298
  • 1
  • 25
  • 46
Charlie-Blake
  • 10,832
  • 13
  • 55
  • 90

3 Answers3

5

You should use whichever encoding your input data is really in. We can't tell you that for you, although if you provide the bytes which are meant to represent those characters, we may be able to suggest some possibilities.

While you can sometimes apply some heuristics to guess at an encoding, you really should know it based on where the data is coming from. In this case you haven't given us any hint whatsoever what your input is - if it's from a web response, you should look at the Content-Type header of the response. If it's from a file, it really depends on what produced that file.

EDIT: Now we know that it is a web response, you don't have to go header-diving yourself, of course. You can use an HTTP client library which will download the data for you and decode it as a string itself.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • Well, the data is coming from a wiki page from the internet, so I don't really know what encoding they are using. – Charlie-Blake Jul 04 '12 at 06:28
  • 2
    @santirivera92: As per my answer, look at the Content-Type header. Or use an HTTP client library which does this for you... – Jon Skeet Jul 04 '12 at 06:32
  • @santirivera92 : if you are using `URLConnection` then you can get Content-Type using `URLConnection.getHeaderFieldKey("Content-type")` and `URLConnection.getHeaderField("Content-type")` – ρяσѕρєя K Jul 04 '12 at 06:34
4

Taken from The Java 5.0 Charset documentation.

Charset     Description
US-ASCII    Seven-bit ASCII, a.k.a. ISO646-US, a.k.a. the Basic Latin block of the Unicode character set
ISO-8859-1  ISO Latin Alphabet No. 1, a.k.a. ISO-LATIN-1
UTF-8       Eight-bit UCS Transformation Format
UTF-16BE    Sixteen-bit UCS Transformation Format, big-endian byte order
UTF-16LE    Sixteen-bit UCS Transformation Format, little-endian byte order
UTF-16      Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark

So try all of these strings in your second parameter until you get the desired encoding.

Brad
  • 9,113
  • 10
  • 44
  • 68
0

Just adding to what the others said the final result is going to be UTF-8 while in Java, and that's going to be able to handle any characters you have. However, the question here is how do you read it, and that depends on what encoding the file is written in which, apparently, is not UTF-8.

Miquel
  • 15,405
  • 8
  • 54
  • 87