8

I search the web alot and didn't find c++ function that replace xml Special Character with their escape sequence? Is there something like this?

I know about the following:

Special Character   Escape Sequence Purpose  
&                   &           Ampersand sign 
'                   '          Single quote 
"                   "          Double quote
>                   >            Greater than 
<                   &lt;            Less than

is there more? what about writing hexadecimal value like 0×00, Is this also a problem?

stefan bachert
  • 9,413
  • 4
  • 33
  • 40
Dor Cohen
  • 16,769
  • 23
  • 93
  • 161
  • Why doing it yourself? 5 string replaces for example – stefan bachert Mar 28 '12 at 08:53
  • @stefanbachert first I know there is more Special Character, lke foreign languages and currency signs, second what about prevent from double encoding? I don't want to double encode &.. and why inventing the wheel? maybe there is someone that thought about things I'm not familiar with.. – Dor Cohen Mar 28 '12 at 08:56
  • 2
    the above 5 default special entities are defines by xml itself. Other entities may defined by the doctype or schema. In the end everyone could define entities. So you won't find a standard function on that. – stefan bachert Mar 28 '12 at 08:59
  • @stefanbachert Why wouldn't there be a standard function that you can feed a list of entity names? The official HTML list is well-defined, by the way. [Part 1](http://www.w3.org/TR/html4/HTMLlat1.ent), [part 2](http://www.w3.org/TR/html4/HTMLsymbol.ent) and [part 3](http://www.w3.org/TR/html4/HTMLspecial.ent). – Mr Lister Mar 28 '12 at 09:11
  • 1
    @DorCohen I just noticed you want to put 0x00 in a xml file. You can't, period. Choose another way of storing your data. – Mr Lister Mar 28 '12 at 09:26
  • @MrLister - can't you use `` ? – Ferruccio May 11 '12 at 13:59
  • @Ferruccio No, XML is a way to store text. If you want to store other things than text, like a byte with value zero, you'll need to use another format. – Mr Lister May 11 '12 at 14:02

6 Answers6

11

Writing your own is easy enough, but scanning the string multiple times to search/replace individual characters can be inefficient:

std::string escape(const std::string& src) {
    std::stringstream dst;
    for (char ch : src) {
        switch (ch) {
            case '&': dst << "&amp;"; break;
            case '\'': dst << "&apos;"; break;
            case '"': dst << "&quot;"; break;
            case '<': dst << "&lt;"; break;
            case '>': dst << "&gt;"; break;
            default: dst << ch; break;
        }
    }
    return dst.str();
}

Note: I used a C++11 range-based for loop for convenience, but you can easily do the same thing with an iterator.

Ferruccio
  • 98,941
  • 38
  • 226
  • 299
8

These types of functions should be standard and we should never have to rewrite them. If you are using VS, have a look at atlenc.h This file is part of the VS installation. Inside the file there is a function called EscapeXML which is much more complete then any of the examples above.

Ah Poil
  • 81
  • 1
  • 1
6

As has been stated, it would be possible to write your own. For example:

#include <iostream>
#include <string>
#include <map>

int main()
{
    std::string xml("a < > & ' \" string");
    std::cout << xml << "\n";

    // Characters to be transformed.
    //
    std::map<char, std::string> transformations;
    transformations['&']  = std::string("&amp;");
    transformations['\''] = std::string("&apos;");
    transformations['"']  = std::string("&quot;");
    transformations['>']  = std::string("&gt;");
    transformations['<']  = std::string("&lt;");

    // Build list of characters to be searched for.
    //
    std::string reserved_chars;
    for (auto ti = transformations.begin(); ti != transformations.end(); ti++)
    {
        reserved_chars += ti->first;
    }

    size_t pos = 0;
    while (std::string::npos != (pos = xml.find_first_of(reserved_chars, pos)))
    {
        xml.replace(pos, 1, transformations[xml[pos]]);
        pos++;
    }

    std::cout << xml << "\n";

    return 0;
}

Output:

a < > & ' " string
a &lt; &gt; &amp; &apos; &quot; string

Add an entry into transformations to introduce new transformations.

hmjd
  • 120,187
  • 20
  • 207
  • 252
2

There is a function, I namely just wrote it:

void replace_all(std::string& str, const std::string& old, const std::string& repl) {
    size_t pos = 0;
    while ((pos = str.find(old, pos)) != std::string::npos) {
        str.replace(pos, old.length(), repl);
        pos += repl.length();
    }
}

std::string escape_xml(std::string str) {
    replace_all(str, std::string("&"), std::string("&amp;"));
    replace_all(str, std::string("'"), std::string("&apos;"));
    replace_all(str, std::string("\""), std::string("&quot;"));
    replace_all(str, std::string(">"), std::string("&gt;"));
    replace_all(str, std::string("<"), std::string("&lt;"));

    return str;
}
orlp
  • 112,504
  • 36
  • 218
  • 315
1

I slightly modified Ferruccio's solution to also eliminate the other characters that are in the way, such as anything < 0x20 and so on (found somewhere on the Internet). Tested and working.

    void strip_tags(string* s) {
    regex kj("</?(.*)>");
    *s = regex_replace(*s, kj, "", boost::format_all);

    std::map<char, std::string> transformations;
    transformations['&']  = std::string("&amp; ");
    transformations['\''] = std::string("&apos; ");
    transformations['"']  = std::string("&quot; ");
    transformations['>']  = std::string("&gt; ");
    transformations['<']  = std::string("&lt; ");

  // Build list of characters to be searched for.
    //
    std::string reserved_chars;
    for ( std::map<char, std::string>::iterator ti = transformations.begin(); ti != transformations.end(); ti++)
    {
        reserved_chars += ti->first;
    }

    size_t pos = 0;
    while (std::string::npos != (pos = (*s).find_first_of(reserved_chars, pos)))
    {
        s->replace(pos, 1, transformations[(*s)[pos]]);
        pos++;
    }



}


string removeTroublesomeCharacters(string inString)
{

    if (inString.empty()) return "";

    string newString;
    char ch;

    for (int i = 0; i < inString.length(); i++)
    {

        ch = inString[i];
        // remove any characters outside the valid UTF-8 range as well as all control characters
        // except tabs and new lines
        if ((ch < 0x00FD && ch > 0x001F) || ch == '\t' || ch == '\n' || ch == '\r')
        {
            newString.push_back(ch);
        }
    }
    return newString;

So in this case, there are two functions. We can get the result with something like:

string StartingString ("Some_value");
string FinalString = removeTroublesomeCharacters(strip_tags(&StartingString));

Hope it helps!

(Oh yeah: credit for the other function goes to the author of the answer here: How do you remove invalid hexadecimal characters from an XML-based data source prior to constructing an XmlReader or XPathDocument that uses the data? )

Community
  • 1
  • 1
Tex
  • 950
  • 1
  • 10
  • 22
0

It appears that you want to generate XML yourself. I think you'll need to be a lot clearer, and read up on the XML specification if you want to be successful. Those are the only XML special characters, you say "I know there is more special character, lke foreign languages and currency signs"... these are not defined in XML, unless you mean by encoding as codepoints (&#163; for example) . Are you thinking HTML, or some other DTD?

The only way to avoid double encoding is to only encode things once. If you get the string "&gt;", how do you know if it's already encoded and I wanted to represent the string ">", or I want to represent the string "&gt;".

The best way is to represent your XML as a DOM (with strings as un-encoded strings), and use and XML serialiser like Xerces

Oh, and remember there's no way to represent characters under 0x20 in XML (apart from &x9;, &xA; and &xD; - whitespace).

davidsheldon
  • 38,365
  • 4
  • 27
  • 28
  • Most xml generators and xml readers are very generous with characters under 0x20; so that would be not that much of a problem. The xml 1.1 standard even formally accepts them (as character references, not the characters themselves). The exception is 0x00, which is not allowed in any shape or form. – Mr Lister Mar 28 '12 at 09:25
  • @MrLister read this http://seattlesoftware.wordpress.com/2008/09/11/hexadecimal-value-0-is-an-invalid-character/ – Dor Cohen Mar 28 '12 at 09:50
  • Yes, that article confirms that you can't store 0x00 chars in an XML file and demonstrates how to remove them. Does that help you? – Mr Lister Mar 28 '12 at 10:02