Read a UTF8 file (created on notepad) and convert to CP850 string

Question

Im trying to read a UTF8 file and convert it to CP850 ( to send to a printer device ). My test string is "ATIVAÇÃO"

A    T    I    V    A    Ç         Ã       O
0x41 0x54 0x49 0x56 0x41 0xC3 0x87 C3 0x83 4F

My java code:

private static void printBytes(String s, String st) {
    byte[] b_str = s.getBytes();
    System .out.print(String.format("%-7s >>> ", st));
    for (int i=0; i<s.length();i++)
        System.out.print(String.format("%-7s ", s.charAt(i)));
    System.out.println();

    System .out.print(String.format("%-7s >>> ", st));
    for (int i=0; i<b_str.length;i++)
        System.out.print(String.format("0x%-5x ", (int)b_str[i] & 0xff));
    System.out.println();
}

public static void main(String [] args) throws Exception, Exception {

    String F="file.txt";

    InputStreamReader input = new InputStreamReader(new FileInputStream(F));
    BufferedReader in = new BufferedReader(input);

    String strFILE;
    String strCP850;

    while ((strFILE = in.readLine()) != null) {

        strFILE = strFILE.substring(3);
        printBytes(strFILE, "ORI");
        strCP850 = new String(strFILE.getBytes(), "CP850");
        printBytes(strCP850, "CP850");
        System.exit(0);
    }

    in.close();

}

The output:

ORI     >>> A       T       I       V       A       Ã       ‡       Ã       ƒ       O       
ORI     >>> 0x41    0x54    0x49    0x56    0x41    0xc3    0x87    0xc3    0x83    0x4f    
CP850   >>> A       T       I       V       A       ?       ç       ?       â       O      
CP850   >>> 0x41    0x54    0x49    0x56    0x41    0x3f    0xe7    0x3f    0xe2    0x4f

I was expecting "Ç" to be 0xc7 and "Ã" 0xc3, but the conversion result in a two byte char (like utf8...).

What im doing wrong?

Is there a way to do this (jdk 1.6)?

Not that it solves anything but instead of `System .out.print(String.format(...))` you can use `System.out.format(...)` or `System.out.printf(...)` — Pshemo, Dec 22 '14 at 19:23

score 1 · Accepted Answer · answered Dec 22 '14 at 19:22

1

First of all: a String has no encoding. What is important that you do correctly however is specify an encoding when you read a file as text.

In order to read a file in UTF-8 and then dump it as cp850: you can do that:

final Path path = Paths.get("file.txt");

try (
    final BufferedReader reader = Files.newBufferedReader(path,
        StandardCharsets.UTF_8);
) {
    String line;
    byte[] bytes;
    while ((line = reader.readLine()) != null) {
        bytes = line.getBytes(Charset.forName("cp850"));
        // write this method
        dumpBytes(bytes);
    }
}

answered Dec 22 '14 at 19:22

fge

119,121
33
254
329

That's correct, I was just writing the 'same' answer. – Michal Dec 22 '14 at 19:28
One might explain that the transformation to and from Java's own *String representation* from and to any encoding (a byte array) is possible, but that a byte array is - just a byte array. (Which appears to be the fallacy in OP's code.) – laune Dec 22 '14 at 19:31
can you do this on jdk 1.6? – fabriciols Dec 22 '14 at 19:32
@fge Internally, a String does have an encoding, like any string that is represented by bytes, shorts, words. You don't need to know it, but it is there ;-) – laune Dec 22 '14 at 19:36
@laune no it doesn't have one; the fact that a `char` is a UTF-16 code unit technically is just an artifact. A `String` could very well be a series of carrier pigeons for the difference that it would make. – fge Dec 22 '14 at 19:39
@fge Even the doves would be an encoding. ;-) – laune Dec 22 '14 at 19:44
1

@fge Note that "encoding" is either outward visible, referable - or not, but it is still there. I'm quibbling, of course. – laune Dec 22 '14 at 19:45
@fge In the same vein, a byte[] "doesn't have an encoding". It may have been created from a String using a certain encoding, but it isn't engraved in that byte[]. If you forget, all you have is - bytes. – laune Dec 22 '14 at 19:49
@fabriciols you don't understand; there is no such this as a 1-to-1 char<->byte mapping. My guess is that you misunderstand the relationship between both. – fge Dec 22 '14 at 20:04
@fge do you have some doc to explain how this work? In my head (after seeing the cp850 wiki) im expecting Ç to be 0xC7 and Ã 0xC3 (i misstyped 0xC7 for Ã in my last comment...) – fabriciols Dec 22 '14 at 20:11
@fabriciols CP 850 is a character coding; it defines a character<->byte mapping, for whatever characters it supports. In the same manner, Unicode defines code points and character codings, for instance UTF-8; and Unicode supports much more code points than cp 850 supports characters. Also, cp 850 is a single byte mapping, whereas UTF-8 is a multibyte mapping (from 1 to 4 bytes per code point). – fge Dec 22 '14 at 20:32

Read a UTF8 file (created on notepad) and convert to CP850 string

1 Answers1