I have the following two minimal XML files
history1.xml
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8" xml:lang="en">
<page>
<title>AccessibleComputing</title>
</page>
<page>
<title>History</title>
</page>
</mediawiki>
history2.xml
<mediawiki>
<page>
<title>AccessibleComputing</title>
</page>
<page>
<title>History</title>
</page>
</mediawiki>
Note that the only difference is all the attributes in the "mediawiki" node. I'm trying to get all page titles with R. Now I type
library("XML")
doc = xmlParse('history1.xml',useInternalNodes=TRUE)
titles<-xpathSApply(doc,'//page/title',xmlValue)
and get an empty list as output
list()
If I am using the second XML file instead:
library("XML")
doc = xmlParse('history2.xml',useInternalNodes=TRUE)
titles<-xpathSApply(doc,'//page/title',xmlValue)
I get what I want, namely
[1] "AccessibleComputing" "History"
The problem is: I am downloading these lists from Wikipedia and I can't always delete these attributes by hand. So my question is:
1) Why is the second file working while the first is not?
2) Is there a way to fix this?
3) If the answer is no: can I automate deleting the attributes in R?
Any help is much appreciated!