How to read fie in different encoding functions?

Question

I am reading a file as shown below in the code, what I want to do is, to be able to read the file in different encodings such as "big endian", "little endian". I looked for that on google but what I got is this link and it is not clear to me how to use this with inputStream and bufferedStream

Code:

private void fileToHash(File myFile) throws IOException {
    // TODO Auto-generated method stub
    this.fileHash = new HashMap<Integer, DataRow>();

    InputStream iS = new FileInputStream(myFile);
    BufferedReader bR = new BufferedReader(new InputStreamReader(iS));

Maybe this may help you? http://stackoverflow.com/questions/8028094/java-datainputstream-replacement-for-endianness — valir, Dec 04 '14 at 13:17

score 2 · Accepted Answer · answered Dec 04 '14 at 13:19

2

You should pass encoding to InputStreamReader constructor, eg

new InputStreamReader(iS, "UTF-16LE");

answered Dec 04 '14 at 13:19

Evgeniy Dorofeev

133,369
30
199
275

thank you. 1-is the encoding names are a class constants? because i know the ecncoding s but do not know the spelling or an example.2- is there a default value for the encoding if it is not specified? – Amrmsmb Dec 04 '14 at 13:22
1

There are a few constants in `StandardCharsets`. The default depends on OS. – Aaron Digulla Dec 04 '14 at 13:25
i checked that class you mentioned, but endians are not there? – Amrmsmb Dec 04 '14 at 13:29
1 - you can find encodings your JRE supports as Charset.availableCharsets(); 2 - Java takes default encoding from OS, you can find it with System.getProperty("file.encoding); – Evgeniy Dorofeev Dec 04 '14 at 13:31

score 2 · Answer 2 · answered Dec 04 '14 at 13:24

The Reader API is meant to process text files with different character sets (or charsets), i.e. ISO Latin, UTF, ASCII, EBCDIC. This is only marginally related to endianess.

If you want to read binary data encoded with different endianess, you can read the bytes yourself and calculate the numbers by shifting bits or using NIO's ByteBuffer. To switch endianess, use order() method.

Related articles:

http://mindprod.com/jgloss/bytebuffer.html

score 2 · Answer 3 · answered Dec 04 '14 at 13:29

InputStream/byte[] is binary, Reader/String is text. The bridging InputStreamReader has an optional parameter for the encoding to be used in conversion. The encoding defaults to the platform encoding.

InputStream iS = new FileInputStream(myFile);
BufferedReader bR = new BufferedReader(new InputStreamReader(iS, encoding));

Testing the file for which encoding it is in, is an art on itself. Violation of UTF-8 multibyte encoding can be detected. ÚTF-16LE and ~BE can often be detected by bytes 0x00 in odd or even positions for ASCII text (when present). I combined finding the encoding with identifying the language, as detecting words with non-ascii characters can help find the encoding used. Using the 100 most frequent words of every language, and encodings per language already helps.

You need to work on bytes, getChannel() for a FileChannel and then use a ByteBuffer. Simply reading the first 4 KB will generate problems with the last bytes: an UTF-8 sequence could be truncated, or a UTF-16 pair split.

There are Charset constants in StandardCharsets, but only for those standard charsets that are available with every JavaSE installation (StandardCharsets.UTF_8 and ISO_8859_1 for instance). Not very useful in your case. But you can test the availability of a Charset:

Charset.availableCharsets()

score 0 · Answer 4 · answered Dec 11 '14 at 21:53

0

also you can take a look at JBBP library which allows to read data written in different byte order, there is JBBPBitInputStream which allows to define byte order (and even bit order)

answered Dec 11 '14 at 21:53

Igor Maznitsa

833
7
12

How to read fie in different encoding functions?

4 Answers4