2

I am parsing a dirty html page with XmlSlurper, and I get the following error:

ERROR org.xml.sax.SAXParseException: Element type "scr" must be followed by either attribute specifications, ">" or "/>".
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        ...
[Fatal Error] :1157:22: Element type "scr" must be followed by either attribute specifications, ">" or "/>".

Now, I have the html I feed it and print it before doing so. If I open it and try to go to the line mentioned in the error, 1157, there is no 'src' in there (but there are hundreds of such string in the file). So I guess some additional stuff is inserted (maybe <script> or something like that) that changes line numbers.

Is there a good way to find exactly the offending line or html piece?

Persimmonium
  • 15,593
  • 11
  • 47
  • 78
  • The error mentions "scr", you're saying you can't find "src". Is that a typo, or are you searching the document for the wrong thing? – Spencer Kormos Jan 05 '12 at 17:16
  • I was using TagSoup too till I found NekoHTML. I can't remember the exact reason but TagSoup just wasn't working out. You can see an example of how to use NekoHTML here - http://stackoverflow.com/questions/9260461/gpath-to-find-if-a-table-header-contains-a-matching-string. – Gaurav Feb 13 '12 at 14:25

2 Answers2

0

You could add an attribute named _lineNum to each element, which can then be used.

import org.xml.sax.Attributes;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.ext.Attributes2Impl;
import javax.xml.parsers.ParserConfigurationException;

class MySlurper extends XmlSlurper {    
    public static final String LINE_NUM_ATTR = "_srmLineNum"
    Locator locator

    public MySlurper() throws ParserConfigurationException, SAXException {
        super();
    }

    @Override
    public void setDocumentLocator(Locator locator) {
        this.locator = locator;
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes attrs) throws SAXException {
        Attributes2Impl newAttrs = new Attributes2Impl(attrs);        
        newAttrs.addAttribute(uri, LINE_NUM_ATTR, LINE_NUM_ATTR, "ENTITY", "" + locator.getLineNumber());        
        super.startElement(uri, localName, qName, newAttrs);
    }
}

def text = '''
<root>
  <a>one!</a>
  <a>two!</a>
</root>'''

def root = new MySlurper().parseText(text)

root.a.each { println it.@_srmLineNum }

The above adds the line num attribute. You can perhaps try to set your own error handler which can read the line number from the locator.

preetham
  • 161
  • 6
0

Which SAXParser are you using? HTML is not strict XML, so using XMLSlurper with the default parser is probably going to result in continued errors.

A cursory google search for "Groovy html slurper" led me to HTML Scraping With Groovy which points to a SaxParser called TagSoup.

Give that a whirl and see if it parses the dirty page.

Spencer Kormos
  • 8,381
  • 3
  • 28
  • 45
  • thanks, I already tried Tagsoup and got nowhere. My code was working fine with XmlSlurper with default parser up to some days ago when the page I ingest changed something. I fix the offending things by code myself before using XmlSlurper, the issue is I cannot find the offending thing now... – Persimmonium Jan 05 '12 at 23:07
  • I'm accepting this although it is not an answer to my question. But I gave Tagsoup another go and this time it worked fine – Persimmonium Jan 09 '12 at 08:35