How can I get TTNTRichEdit unicode content in Delphi 7?

Question

How can I get/set a TTNTRichEdit RTF content in unicode (utf8/utf16) format? I use the TRichEdit.loadFromStream/saveToStream methods by TStringStreams to get-set the RTF content. But it use just locale dependent ANSI codes for non standart ASCII characters. (4x : \`f5 ) But I'm going to be in trouble if the user carry him/her project to another computer with a different locale. The national characters will be lost. The EM_STREAMIN/EM_SREAMOUT messages SF_UNICODE flag can just combined with SF_TEXT not by SF_RTF.

This is my problem dear David! How can I get the Unicode contents. I wish the best to use unicode. But TRichEdit does not give it back to me (as I know). — The Bitman, Aug 11 '15 at 14:39
No, I mean stop using an ANSI control. Use a Unicode Delphi, or use Unicode controls in your non-Unicode Delphi. Fix the problem at source. Stop using ANSI `TRichEdit`. — David Heffernan, Aug 11 '15 at 14:41
I use TTNTRichEdit which is unicode compatible in priciple. But its installed documentation does not mention anything about how could I get/set its rtf content in unicode. To change the compiler not an option now. Could you recomment another component maybe? — The Bitman, Aug 11 '15 at 14:50
Your question speaks of `TRichEdit` but now you speak of `TTNTRichEdit`. — Jerry Dodge, Aug 11 '15 at 14:52
Oh. You already use the TNT components. Then you are already in business. However, the question is pretty broken. I suggest that you start again. However, it seems that you are not very well versed in matters of Unicode. I recommend you pause and do some background reading. Read Marco's white paper on Unicode. I hope you follow this advice, although people seldom do! — David Heffernan, Aug 11 '15 at 14:53
Your question edit isn't much better. You want to read/write the content rather than get/set it. We don't know which version of TNT you are using. I don't think you are trying hard enough in your question asking. — David Heffernan, Aug 11 '15 at 15:00
OK. I've modify my question. So my problem is the same. There is no TNT documention in this regard. The lines property is a good unicode interface for plain texts, But the RTF in/out stream use ANSI later on. — The Bitman, Aug 11 '15 at 15:03
As I said, your question lacks effort. There are many versions of the TNT controls around. You need to find one that passes `SF_RTF` when sending the `EM_STREAMOUT` message. You don't need to combine with `SF_UNICODE`. That's for plain text. — David Heffernan, Aug 11 '15 at 15:07
I want to read/write it because of the header informations (default values, by this time the codepage, etc). Namely I have to build an object tree which can show the formatted texts. And when the user double click on it, the simplest way to set the richedit up by writing back the RTF stream. And after the changes have done rebuild the object tree by my RTF parser. — The Bitman, Aug 11 '15 at 15:17
Oh never mind. I've now absolutely no idea what your problem is. As far as I can tell, you are not making enough effort to explain your problem. You are using a control with full support for Unicode. Please make it clear what your problem is. Please also take some time to make sure you understand how Unicode works. — David Heffernan, Aug 11 '15 at 15:20
Just be relaxed. I've "some" practice regarding unicode. I've no practice regarding underdocumented components :( I need unicode RTF (not plain text) to marshall it to my own object tree (DOM). But it should be modifiable, so I have to recall the dialog to modify it, But I don't want to unmarshall the DOM, just write back the stored RTF content to the RichEdit. I thought it is the simplest way, but slowly... — The Bitman, Aug 11 '15 at 15:39
Did you consult the vendors of this component? That should be the first thing to do. — Jerry Dodge, Aug 11 '15 at 15:45
I wrote a (commercial) RTF to HTML converter which builds a DOM internally, however it does not provide any means to write it back to RTF. Maybe you can contact the author of [TRichView](http://www.trichview.com) and ask if they support what you need (RTF generation from modified DOM) — mjn, Aug 11 '15 at 17:22

David Heffernan · Accepted Answer · 2015-08-11T22:28:14.993

You have no problem. You are using a Unicode compliant component. You will not suffer data loss. From the Wikipedia article on RTF:

A standard RTF file can consist of only 7-bit ASCII characters, but can encode characters beyond ASCII by escape sequences. The character escapes are of two types: code page escapes and, starting with RTF 1.5, Unicode escapes. In a code page escape, two hexadecimal digits following a backslash and typewriter apostrophe are used for denoting a character taken from a Windows code page. For example, if the code page is set to Windows-1256, the sequence \'c8 will encode the Arabic letter bāʼ (ب).

For a Unicode escape the control word \u is used, followed by a 16-bit signed decimal integer giving the Unicode UTF-16 code unit number. For the benefit of programs without Unicode support, this must be followed by the nearest representation of this character in the specified code page. For example, \u1576? would give the Arabic letter bāʼ ب, specifying that older programs which do not have Unicode support should render it as a question mark instead.

You are observing a code page escape. But that's fine. That's what \`f5 is. The character is found in the document's code page, and hence a code page escape can be used. If you include characters outside the document's code page then the control will use a Unicode escape.

Thank you David! That's fine exploration and some new information to me. But there is some vague thing- Here is one RTF export of the component: '{\rtf1\ansi\ansicpg1252\deff0{\fonttbl{\f0\fnil\fcharset238 BatangChe;}{\f1\fnil Arial;}}{\colortbl ;\red255\green0\blue0;}\viewkind4\uc1\pard\cf1\lang1038\f0\fs44\'f5\'fa\'fb\cf0\f1\fs40 \par }' The contained codepage is 1252, but my Win uses 1250. And it contains ANSI char code \'f5 with code page 1252. It is strange to me. No I see. It's version is 1! — The Bitman, Aug 12 '15 at 10:42

score 0 · Answer 2 · answered Jun 25 '21 at 16:40

Solved (by necessity) using Borland C++ 6. Same code pattern applies for Borland Delphi. (NOTE: TTntRichEdit loads UTF-8 text as UTF-8 ONLY when it explicitly has the BOM header "\357\273\277" or [0xEF, 0xBB, 0xBF])

// This only works with BOM explicit files
// (it will fail on BOM-less UTF-8 files)
TTntRichEdit *myTntRichEdit = ...{some init code}...
myTntRichEdit->Lines->LoadFromFile(UTF8_filename);

So here is my working production code: (Note: TRESource declaration is TTntRichEdit *TRESource;)

void TFormMyExample::LoadJavascriptFromFile(AnsiString myFile) {
    // This method will load a UTF-8 text file (with or without BOM)

    // // // TRESource->Lines->LoadFromFile(myFile);

    TMemoryStream *JSMemoryStream;
    TMemoryStream *JSBOM_MemoryStream;
    AnsiString BOM = "\357\273\277"; // [0xEF, 0xBB, 0xBF]

    try {
        JSMemoryStream = new TMemoryStream();
        JSMemoryStream->LoadFromFile(myFile);

        // check for BOM
        char BOMHeader[4];
        JSMemoryStream->Seek(0, soFromBeginning);
        JSMemoryStream->ReadBuffer(BOMHeader, 3);
        JSMemoryStream->Seek(0, soFromBeginning); // reset
        BOMHeader[3] = 0;

        if (strcmp(BOM.c_str(), BOMHeader) == 0) {
            // We have BOM header, so load it.
            TRESource->Lines->LoadFromStream(JSMemoryStream);
        } else {
            // We need the BOM header, so add it.
            try {
                JSBOM_MemoryStream = new TMemoryStream;
                JSBOM_MemoryStream->Write(BOM.c_str(), BOM.Length());

                JSBOM_MemoryStream->Seek(0,soFromEnd);
                JSBOM_MemoryStream->CopyFrom(JSMemoryStream, 0);
                
                JSBOM_MemoryStream->Seek(0, soFromBeginning);
                TRESource->Lines->LoadFromStream(JSBOM_MemoryStream);
            }
            __finally
            {
                delete JSBOM_MemoryStream;
            }
        }

    }
    __finally
    {
        delete JSMemoryStream;
    }

}

When I write the processed file, it's done in this manner. (Note: TREProcessed declaration is TTntRichEdit *TREProcessed; also: AnsiString outputFileName;)

    ofstream SaveFile(outputFileName.c_str());
    TREProcessed->PlainText = true;
    SaveFile << "\357\273\277"; // Add UTF8 BOM [0xEF, 0xBB, 0xBF]

    for (int i = 0, max = TREProcessed->Lines->Count; i < max; i++) {
        SaveFile << UTF8Encode(TREProcessed->Lines->Strings[i]).c_str();
        if (i < max - 1) {
            SaveFile << UTF8Encode(_WS "\n").c_str();
        }
    }
    SaveFile.close();

How can I get TTNTRichEdit unicode content in Delphi 7?

2 Answers2