2

I am parsing following XMLPullParser with Jsoup

<title>(??????) [????]0 BLACK LAGOON -???? &middot; ????- ?01-09?</title>
        <guid isPermaLink='true'>http://fenopy.eu/torrent/+black+lagoon+A+01+09+/OTcyOTA3Mw</guid>
        <pubDate>Wed, 27 Feb 2013 11:00:04 GMT</pubDate>
        <category>Anime</category>
        <link>http://fenopy.eu/torrent/+black+lagoon+A+01+09+/OTcyOTA3Mw</link>
        <enclosure url="http://fenopy.eu/torrent/-BLACK-LAGOON-01-09-/OTcyOTA3Mw==/download.torrent" length="569296173" type="application/x-bittorrent" />
        <description><![CDATA[ Category: Anime<br/>Size: 542.9 MB<br/>Ratio: 0 seeds, 3 leechers<br/> ]]></description>
        </item>

Here is my parsing code

int eventType = -1;

            while (eventType != XmlPullParser.END_DOCUMENT) {
                switch (eventType) {
                // at start of document: START_DOCUMENT
                case XmlPullParser.START_DOCUMENT:                      
                    break;

                // at start of a tag: START_TAG
                case XmlPullParser.START_TAG:
                    // get tag name
                    String tagName = parser.getName();


                    if (tagName.equalsIgnoreCase(TAG_TITLE))                            
                        String t = parser.nextText();

When I call next text and it throws following exception..

org.xmlpull.v1.XmlPullParserException: unresolved: &middot; (position:TEXT (??????) [????] ...@36:59 in java.io.StringReader@40540698) 
at org.kxml2.io.KXmlParser.exception(KXmlParser.java:273)
at org.kxml2.io.KXmlParser.error(KXmlParser.java:269)
at org.kxml2.io.KXmlParser.pushEntity(KXmlParser.java:818)
at org.kxml2.io.KXmlParser.pushText(KXmlParser.java:849)
at org.kxml2.io.KXmlParser.nextImpl(KXmlParser.java:354)
at org.kxml2.io.KXmlParser.next(KXmlParser.java:1378)
at org.kxml2.io.KXmlParser.nextText(KXmlParser.java:1432)
AZ_
  • 21,688
  • 25
  • 143
  • 191

3 Answers3

7

I was dealing with the same problem and I found super easy solution:

xmlPullParser.setFeature(Xml.FEATURE_RELAXED, true);
GuirNab
  • 193
  • 3
  • 7
  • To be clear: import android.util.Xml; ... XmlPullParser parser = factory.newPullParser(); parser.setFeature(Xml.FEATURE_RELAXED, true); – SandroMarques Jan 13 '17 at 11:18
1

Your xml isn't valid. &middot; is invalid reference for xml.

There are 5 predefined entity references in XML:

&lt; < less than

&gt; > greater than

&amp; & ampersand

&apos; ' apostrophe

&quot; " quotation mark

Updated

Simple use regex to replace all HTML characters from XML

XMLString.replaceAll("(&[^\\s]+?;)", ""));

this will replace &middot; by ""

AZ_
  • 21,688
  • 25
  • 143
  • 191
MrSantaK
  • 66
  • 3
  • Yeah but there are so many invalid invalid characters for xml. So is there any thing that it should do automatically... like setting some feature XmlPullParser.FEATURE_PROCESS_DOCDECL do you know anything ? – AZ_ Feb 27 '13 at 12:51
  • I'm afraid, this references are out of xml specification, and they come from html spec. As I can see you have file that mix xml and html. There is method defineEntityReplacementText that can define replacement for entity reference, but you should define each reference you have. There is list of html entities http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references – MrSantaK Feb 27 '13 at 13:45
  • Would you write a regex for me that remove everything all words that starts with & and ends with ;? – AZ_ Feb 27 '13 at 13:48
  • There was similar problem http://stackoverflow.com/questions/10438694/java-convert-named-html-entities-to-numbered-xml-entities , may be you can use that solution? – MrSantaK Feb 27 '13 at 13:59
  • Or you can use apache commons lang library. Method StringEscapeUtils.unescapeHtml4 convert entity reference to unicode character, but you should take care about encoding. – MrSantaK Feb 27 '13 at 14:04
1

Maybe you can do:

parser.setInput(...);
parser.defineEntityReplacementText("middot", "•");

As this does not work with your implementation:

From apache commons-lang use the HTML conversion, as it seems to be HTML named entities:

String xml = "<foo>Hello &middot; World!</foo>";
xml = StringEscapeUtils.unescapeHtml(xml);

Comment's question:

Replacing all indiscriminate:

String xml = "<...";

// Place all entities like "&middot;" in square brackets: "[middot]":
xml = xml.replaceAll("\\&(\\w+);", "[$1]");

// But not for the xml entities:
xml = xml.replaceAll("\\[(lt|gt|amp|quot|apos)\\]", "&$1;");
Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • I did this after setting the input but this has no effect. do you know anything about setting XmlPullParser.FEATURE_PROCESS_DOCDECL so that It should not throw exception on such elements... – AZ_ Feb 27 '13 at 12:49
  • 1
    Sorry no. You are thinking of first prefixing ` ]>...` ? Maybe before setInput (not so with my implementation)? – Joop Eggen Feb 27 '13 at 13:05
  • I want to remove all invalid characters from XML. Is there any regex or any library available to do so .. because I want to remove all of them. – AZ_ Feb 27 '13 at 13:14