0

I have a .rtf file. The file is in windows-1251 encoding.

I need to save this string to another file, and I need to save it in utf-8 encoding. And I need this file to be well-readable in result.

So, I try a lot of variants, I read java-docs, and other sources, I spent 2 days in searching for answer, but still, I can't convert it to well-readable file

Here is a file with that string, that you can download to run my tests

That is image content of file

enter image description here

Here is my java tests, that you can use and try to convert file

This is a short cases of my code from file

@Test
public void windows1251toUtf8() throws IOException {
    //Prepare file
    File dir = new File("/tmp/TESTS/");
    if (!dir.exists() && !dir.mkdirs()) {
        throw new RuntimeException("Cant create destination dir");
    }
    File destination = new File(dir, "test.rtf");
    if (!destination.exists() && !destination.createNewFile()) {
        throw new RuntimeException("Cant create destination file");
    }

    //-----------------------------------------------------------------------------------------

    //Not work
    InputStream inputStream = getClass().getClassLoader().getResourceAsStream("utils/encoding/windows1521File.rtf");
    Scanner sc = new Scanner(inputStream, "WINDOWS-1251");
    StringJoiner stringBuilder = new StringJoiner("\n");
    while (sc.hasNextLine()) {
        stringBuilder.add(sc.nextLine());
    }

    String text = decode(stringBuilder.toString(), "WINDOWS-1251", "UTF-8");

    byte[] bytes = text.getBytes(Charset.forName("UTF-8"));

    Files.write(bytes, destination);


    //-----------------------------------------------------------------------------------------

    //Not work
    URL resource = getClass().getClassLoader().getResource("utils/encoding/windows1521File.rtf");
    String string = FileUtils.readFileToString(new File(resource.getPath()), Charset.forName("WINDOWS-1251"));

    byte[] bytes = convertEncoding(string.getBytes(), "WINDOWS-1251", "UTF-8");

    FileUtils.writeByteArrayToFile(destination, bytes);

    //-----------------------------------------------------------------------------------------

    //Not work
    InputStream inputStream = getClass().getClassLoader().getResourceAsStream("utils/encoding/windows1521File.rtf");

    byte[] bytes = IOUtils.toByteArray(inputStream);
    String s = new String(bytes);
    byte[] bytes2 = s.getBytes("WINDOWS-1251");

    FileUtils.writeByteArrayToFile(destination, bytes2);
}

public static byte[] convertEncoding(byte[] bytes, String from, String to) throws UnsupportedEncodingException {
    return new String(bytes, from).getBytes(to);
}

public static String decode(String text, String textCharset, String resultCharset) {
    if (StringUtils.isEmpty(text)) {
        return text;
    }

    try {
        byte[] bytes = text.getBytes(textCharset);
        ByteArrayInputStream inputStream = new ByteArrayInputStream(bytes);
        byte[] tmp = new byte[bytes.length];
        int n = inputStream.read(tmp);
        byte[] res = new byte[n];
        System.arraycopy(tmp, 0, res, 0, n);
        return new String(res, resultCharset);
    } catch (IOException e) {
        throw new RuntimeException(e);
    }
}

In all cases in result, I catch something like this

enter image description here

Or like this

enter image description here

Is there any way to do conversion?

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
  • The conversion you're trying to do looks like it's for plaintext files and not RTF files. According to https://en.wikipedia.org/wiki/Rich_Text_Format, it doesn't look like RTF supports UTF-8. It can encode unicode characters but you'd need to write them with its own escaping format. – fgb Apr 23 '21 at 16:45
  • 3
    A `String` is a sequence of Unicode characters. The `String decode(String text, String textCharset, String resultCharset)` method makes no sense at all. The two steps, 1) read into a `String` using the source encoding and 2) write the `String` using the target encoding, are enough to convert a file. Opening the file in a word processor will never work afterwards, as the RTF file contains a declaration that it is encoded in 1251 and the word processor will interpret the file as such. Which brings us to the question why you think you need to convert the file to a different encoding. – Holger Apr 23 '21 at 17:05
  • Your link seems to be dead… – JosefZ Apr 23 '21 at 18:48
  • @JosefZ only if you disallow the 50 JS scripts from 10 sub domains to execute. – AmigoJack Apr 23 '21 at 20:26
  • For all intents and purposes, you should consider RTF a binary format (although technically it isn't, not really), you can't just read it in one character set, and then try to write it out in another character set, because that is simply not how it works. – Mark Rotteveel Apr 24 '21 at 07:53

0 Answers0