2

I'm parsing XML docs from PubMed Central and sometimes I find paragraphs with nested tables like the example below. Is there a way in R to get the text and exclude the table?

doc <- xmlParse("<sec><p>Text</p>
  <p><em>More</em> text<table>
   <tr><td>SKIP</td><td>this</td></tr>
  </table></p>
 </sec>")

xpathSApply(doc, "//sec/p", xmlValue)
[1] "Text"              "More textSKIPthis"

I'd like to return paragraphs without the nested table rows.

[1] "Text"      "More text"
Chris S.
  • 2,185
  • 1
  • 14
  • 14

1 Answers1

3

You can remove the nodes you dont want. In this example I remove nodes given by the XPATH //sec/p/table

library(XML)
doc <- xmlParse("<sec><p>Text</p>
  <p>More text<table>
   <tr><td>SKIP</td><td>this</td></tr>
                </table></p>
                </sec>")


xpathSApply(doc, "//sec/p/table", removeNodes)
xpathSApply(doc, "//sec/p", xmlValue)
[1] "Text"      "More text"

If you want to keep your doc intact you could also consider:

library(XML)
doc <- xmlParse("<sec><p>Text</p>
  <p>More text<table>
   <tr><td>SKIP</td><td>this</td></tr>
                </table></p>
                </sec>")
> xpathSApply(doc, "//sec/p/node()[not(self::table)]", xmlValue)
[1] "Text"      "More text"

or simply:

xpathSApply(doc, "//sec/p/text()", xmlValue)
[1] "Text"      "More text"

which is best will depend on the complexity of your real world case.

jdharrison
  • 30,085
  • 4
  • 77
  • 89