-1

I'm trying to parse XML file which contains Cyrillic letters, and I receive Parse error: "unexpected end of data"

Here is the code that I use to parse, and the catch statement that I enter in.

rapidxml::xml_document<TCHAR> doc;
rapidxml::xml_node<TCHAR>* rootNode;

// Helping in the debug
// std::cout << nElementIndex << std::endl;

const int SIZE = 300;
LPWSTR indirectString = new wchar_t[SIZE];

TCHAR* temp = m_vecContainer[nElementIndex].xml.GetBuffer();

try 
{
    doc.parse<0>(&temp[0]);
}
catch (rapidxml::parse_error &e)
{
    return ERROR_INVALID_FUNCTION;
}

Here is an example what can return xml.GetBuffer() method:

<?xml version="1.0" encoding="UTF-16"?>
<Task version="1.2" xmlns="http://schemas.microsoft.com/windows/2004/02/mit/task">
  <RegistrationInfo>
    <Version>1.3.33.5</Version>
    <Description>Поддържа актуален софтуера ви от Google. Ако тази задача е деактивирана или спряна, софтуерът ви от Google няма да е актуален, което означава, че ако в сигурността възникне уязвимост, тя няма да бъде коригирана и е възможно някои функции да не работят. Тази задача се деинсталира сама, когато няма софтуер от Google, който да я използва.</Description>
    <URI>\GoogleUpdateTaskMachineCore</URI>
  </RegistrationInfo>
...
</Task>

Can someone help me, because I cannot find any useful information on the internet.

Thanks in advance.

Mario
  • 87
  • 2
  • 10
  • Are you compilng with TCHAR = char or wchar_t? RapidXML works fine with UTF-8 but your'e passing it wide chars.. If they're UTF-16 then possibly it may work, but if they're something else all bets are off. (and the encoding attribute in the XML isn't relevant!) – Roddy May 26 '17 at 19:41

1 Answers1

0

I am not familiar with rapidxml, but a quick search shows that it handles utf8 input by default. So, your problem is not the Cyrillic letters probably. I would focus on the 'unexpected end of data' notice instead and confirm that the XML feed obeys the strict XML rules. Try these tools:

http://www.xmlvalidation.com/
http://www.utilities-online.info/xsdvalidation/#.WSgPG2iGOUk

If your XML is valid, I'm sorry I don't have other clues to help you with.

Good luck !

George Dimitriadis
  • 1,681
  • 1
  • 18
  • 27
  • Thank you for your fast comment, I have tried both tools before I write this question and both tools validate the XML file. Also I have read their online manual and find that rapidxml shall work with UTF-8 and wchar_t and TCHAR strings. Also I tried to change the cyrillic text in the XML file to Latin, and everything works fine ... And that is why I guess the Cyrillic is the problem. – Mario May 26 '17 at 11:29
  • @Mario Ah, then I would agree with you that the contents of the XML are to blame. However, before we blame Cyrillic directly, can you try to replace all Cyrillic with Latin and mix just a single one Cyrillic character in the text ? If that doesn't fail, you have something trickier to deal with. – George Dimitriadis May 26 '17 at 11:34
  • Hm, very strange, and you were right. I mixed with Cyrillic and Latin, and if I add one word or 1 letter, the parser works. But if I add couple of sentences it does not work. – Mario May 26 '17 at 12:27
  • Are the "couple of sentences" your own text or copy paste ? I would try to see what kind of data is encoded in those sentences and try to figure out how to make the parser able to handle them. Sometimes there might be 'invisible' characters inside the copied string, you can spot them while backspacing the string and at one point you'll hit backspace and nothing will happen and that's when you delete an invisible character. No idea how to deal with those though ... – George Dimitriadis May 26 '17 at 16:45