Unable to parse element attribute with XOM

Question

I'm attempting to parse an RSS field using the XOM Java library. Each entry's image URL is stored as an attribute for the <img> element, as seen below.

<rss version="2.0">
  <channel>
    <item>
      <title>Decision Paralysis</title>
      <link>https://xkcd.com/1801/</link>
      <description>
        <img src="https://imgs.xkcd.com/comics/decision_paralysis.png"/>
      </description>
      <pubDate>Mon, 20 Feb 2017 05:00:00 -0000</pubDate>
      <guid>https://xkcd.com/1801/</guid>
    </item>
  </channel>
</rss>

Attempting to parse <img src=""> with .getFirstChildElement("img") only returns a null pointer, making my code crash when I try to retrieve <img src= ...>. Why is my program failing to read in the <img> element, and how can I read it in properly?

import nu.xom.*;

public class RSSParser {
    public static void main() {
        try {
            Builder parser = new Builder();
            Document doc = parser.build ( "https://xkcd.com/rss.xml" );
            Element rootElement = doc.getRootElement();
            Element channelElement = rootElement.getFirstChildElement("channel");
            Elements itemList = channelElement.getChildElements("item");

            // Iterate through itemList
            for (int i = 0; i < itemList.size(); i++) {
                Element item = itemList.get(i);
                Element descElement = item.getFirstChildElement("description");
                Element imgElement = descElement.getFirstChildElement("img");
                // Crashes with NullPointerException
                String imgSrc = imgElement.getAttributeValue("src");
            }
        }
        catch (Exception error) {
            error.printStackTrace();
            System.exit(1);
        }
    }
}

Elliotte Rusty Harold · Answer 1 · 2016-12-03T14:21:15.457

0

There is no img element in the item. Try

  if (imgElement != null) {
    String imgSrc = imgElement.getAttributeValue("src");
  }

What the item contains is this:

<description>&lt;img    
    src="http://imgs.xkcd.com/comics/us_state_names.png" 
    title="Technically DC isn't a state, but no one is too 
    pedantic about it because they don't want to disturb the snakes
    ." 
     alt="Technically DC isn't a state, but no one is too pedantic about it because they don't want to disturb the snakes." /&gt;  
</description>

That's not an img elment. It's plain text.

edited Dec 03 '16 at 14:21

answered Nov 30 '16 at 23:11

Elliotte Rusty Harold

963
7
15

This doesn't solve the problem of not being able to parse `img src=` – Stevoisiak Feb 17 '17 at 15:52

score 0 · Answer 2 · answered Feb 17 '17 at 16:20

I managed to come up with a somewhat hacky solution using regex and pattern matching.

// Iterate through itemList
for (int i = 0; i < itemList.size(); i++) {
    Element item = itemList.get(i);
    String descString = item.getFirstChildElement("description").getValue();

    // Parse image URL (hacky)
    String imgSrc = "";
    Pattern pattern = Pattern.compile("src=\"[^\"]*\"");
    Matcher matcher = pattern.matcher(descString);
    if (matcher.find()) {
        imgSrc = descString.substring( matcher.start()+5, matcher.end()-1 );
    }
}

Unable to parse element attribute with XOM

2 Answers2