3

I have an Cp1252 file that I want to read as binary.

ls -al from the terminal shows its size is 10 bytes.

This java snippet however reports 18 bytes:

Path path = Paths.get(lfile);
SeekableByteChannel sbc = Files.newByteChannel(path, StandardOpenOption.READ);
long size = sbc.size();

The file contains 6 ascii character + 4 Cp1252 characters. My understanding is that 10 bytes is the correct size of this file on the file system. One more detail: when trying to read the content of the file using:

byte[] fileContents = Files.readAllBytes(path);

I get 18 bytes, as each Cp1252 char is loaded as 3 bytes. In file I have different Cp1252 chars, buffer shows them all as being the same - which is incorrect for sure.

Two questions bother me:

  1. How many bytes does this file actually take on a file system.

  2. Presuming that it is 10 bytes long, how to read it as "raw"

Update: I tried the same using a small C program and results are as expected: 10 characters are read from the file and 4 of them that are Cp1252 are all of different value.

int main() {
    char fileName[200] = "test.x10";
    FILE *fp = fopen(fileName, "r");
    while(1) {
        int c = fgetc(fp);
        if( feof(fp) )
            break ;
        printf("%i ", c);
    }
    fclose(fp);
}

Update 2:

test.x10 contains Cp1252 characters: aöaäaüaßbb

C code given above prints out: 97 246 97 228 97 252 97 223 98 98

Files.readAllBytes reads: 97 239 191 189 97 239 191 189 97 239 191 189 97 239 191 189 98 98

Here is the hexdump:

hexdump -C test.x10
00000000  61 f6 61 e4 61 fc 61 df  62 62                   |a.a.a.a.bb|
stuhpa
  • 306
  • 3
  • 13
  • 1
    CP1252 is not a multibyte character set. It is a Windows variant of the Latin-1 or ISO-8859-1 character set. But if you want relevant answers, you should show the hexadecimal values of the 10 bytes of the file. – Serge Ballesta Mar 13 '18 at 14:48
  • Try "rb" (binary) instead of "r" in C (as indeed something is fishy). Try a full absolute path in java. – Joop Eggen Mar 14 '18 at 08:03
  • "rb" does not change a thing. C works as expected, which is confirmed also by the result from hexdump. Only Java is misbehaving. I use absolute path, I just simplified it in the code snippet. – stuhpa Mar 14 '18 at 10:48

0 Answers0