Get XML paragraphs without nested tables

Question

I'm parsing XML docs from PubMed Central and sometimes I find paragraphs with nested tables like the example below. Is there a way in R to get the text and exclude the table?

doc <- xmlParse("<sec><p>Text</p>
  <p><em>More</em> text<table>
   <tr><td>SKIP</td><td>this</td></tr>
  </table></p>
 </sec>")

xpathSApply(doc, "//sec/p", xmlValue)
[1] "Text"              "More textSKIPthis"

I'd like to return paragraphs without the nested table rows.

[1] "Text"      "More text"

jdharrison · Accepted Answer · 2014-09-16T21:36:21.660

3

You can remove the nodes you dont want. In this example I remove nodes given by the XPATH //sec/p/table

library(XML)
doc <- xmlParse("<sec><p>Text</p>
  <p>More text<table>
   <tr><td>SKIP</td><td>this</td></tr>
                </table></p>
                </sec>")


xpathSApply(doc, "//sec/p/table", removeNodes)
xpathSApply(doc, "//sec/p", xmlValue)
[1] "Text"      "More text"

If you want to keep your doc intact you could also consider:

library(XML)
doc <- xmlParse("<sec><p>Text</p>
  <p>More text<table>
   <tr><td>SKIP</td><td>this</td></tr>
                </table></p>
                </sec>")
> xpathSApply(doc, "//sec/p/node()[not(self::table)]", xmlValue)
[1] "Text"      "More text"

or simply:

xpathSApply(doc, "//sec/p/text()", xmlValue)
[1] "Text"      "More text"

which is best will depend on the complexity of your real world case.

edited Sep 16 '14 at 21:36

answered Sep 16 '14 at 20:31

jdharrison

30,085
4
77
89

That's nice. I wasn't yet aware of `removeNodes` – Rich Scriven Sep 16 '14 at 20:34
Yes its pretty useful in certain cases. – jdharrison Sep 16 '14 at 20:35
Thanks, that is useful - I'll just need to get //table before getting //p in these rare cases. – Chris S. Sep 16 '14 at 21:17
If you dont want to delete nodes you can select nodes within `p` that are not `table` nodes. – jdharrison Sep 16 '14 at 21:23
There are usually lots of tags within the
, so I added one to the original question. All these options are really helpful and in my case, I'll probably remove the table nodes so I can get back a vector with 2 paragraphs. Thanks again.
– Chris S. Sep 16 '14 at 21:55

Get XML paragraphs without nested tables

1 Answers1