1

So I already have a class that will read 8 bits from a file each time when I call the method read(). All the characters' corresponding decimal number is in ASCII table. Now I encountered a character 'É' whose ASCII code binary code is 11001001. And the result is correct when i call

System.out.println(Integer.toBinaryString('É'));

However, when I open the file in binary format the actual bits is 11000011 10001001 00001010. I understand that 00001010 is a line feed. But 11000011 and 10001001 definitely don't match the 11001001. I changed the file and made it only contain 'a' and now the file only contains 01100001 for a , which is correct. The character encoding is UTF-8. Here's my code to put the character and its frequency into a map

while ((bit = readInputStream()) != -1) {
        if (!bitOccurrence.containsKey(bit))
            bitOccurrence.put(bit, 1);
        else
            bitOccurrence.put(bit, bitOccurrence.get(bit) + 1);
    }

Here is the private readInputStream method

 private int readInputStream() throws IOException {
    InputStreamReader r = new InputStreamReader(i); // i is the InputStream
    return r.read();

}

So my question is how does this problem occur and what is the work around for this problem if I can only read 8 bits each time?

edhu
  • 449
  • 6
  • 23
  • 2
    What is the character encoding of the text file. Which tool do you use to edit it? Why do you read text as bytes in the first place, rather than reading it as characters, using a Reader, configured with the appropriate character encoding? – JB Nizet Nov 26 '16 at 10:18
  • I believe it's UTF8 and I'm editing it in vim. I read it as bytes because I'm only provided with a modified InputStream class to read the file and it only has read bits method. – edhu Nov 26 '16 at 10:22
  • 1
    Use an InputStreamReader wrapping your InputStream. And it is indeed UTF8. – JB Nizet Nov 26 '16 at 10:23
  • 1
    This is a classical X for Y problem. What do you _really_ want to achieve? Reading bits from a stream cannot be your main goal. – Roland Illig Nov 26 '16 at 10:38
  • This is simply a way to read characters from a file and calculate the number of occurrences for each character. I know I can simply read characters instead of bits but I'm only provided the modified InputStream class. – edhu Nov 26 '16 at 10:41
  • But you still haven't explained what a "modified InputStream class" is. If it's an InputStream, just use new InputStreamReader(modifiedInputStreamInstance, StandardCharsets.UTF_8), and read characters from this reader. If it's not an InputStream, then post its code, or at least its javadoc or interface. – JB Nizet Nov 26 '16 at 10:50
  • @JBNizet Okay I changed my code and instead used the original InputStream class from Java standard library and the made a InputStreamReader class but the problem still exists. That character is still divided into 2 irrelevant characters. – edhu Nov 26 '16 at 11:13
  • 1
    And your code is? Post your code if you want us to explain why it doesn't work as expected. We're not extra-lucif wizards. – JB Nizet Nov 26 '16 at 11:16
  • You're not specifying the correct encoding. As I said in my previous comment: `new InputStreamReader(modifiedInputStreamInstance, StandardCharsets.UTF_8)`. Also, what you're reading is characters not bits. Choose good variable names. And finally, Create the Reader once and only once. Not every time you read a character. – JB Nizet Nov 26 '16 at 12:13
  • There is no text but encoded text. É is not a member of ASCII. Forget ASCII. Learn a bit about [Unicode](http://www.unicode.org/faq/basic_q.html) and text in general; espcially, the difference between bytes, code units, codepoints, combining codepoints, and graphemes. You'll soon see how [É](http://www.fileformat.info/info/unicode/char/00c9/index.htm) is encoded in UTF-8 as 11000011 10001001. You should also know that Java char is a UTF-16 code unit and a string is a counted sequence of UTF-16 code units. – Tom Blodget Nov 26 '16 at 15:18
  • After that, you can review and revise your requirements to clearly identify what you are trying to count the occurrence of. – Tom Blodget Nov 26 '16 at 15:33

0 Answers0