I'm parsing XML docs from PubMed Central and sometimes I find paragraphs with nested tables like the example below. Is there a way in R to get the text and exclude the table?
doc <- xmlParse("<sec><p>Text</p>
<p><em>More</em> text<table>
<tr><td>SKIP</td><td>this</td></tr>
</table></p>
</sec>")
xpathSApply(doc, "//sec/p", xmlValue)
[1] "Text" "More textSKIPthis"
I'd like to return paragraphs without the nested table rows.
[1] "Text" "More text"
, so I added one to the original question. All these options are really helpful and in my case, I'll probably remove the table nodes so I can get back a vector with 2 paragraphs. Thanks again.
– Chris S. Sep 16 '14 at 21:55