I'm trying to read and process a ~5.8GB .xml
from Wikipedia Dumps using R. I don't have so much RAM so I would like to process it in chunks. (Currently when using xml2::read_xml
blocks my computer completely)
The file contais one xml
element for each wikipedia page, like this:
<page>
<title>AccessibleComputing</title>
<ns>0</ns>
<id>10</id>
<redirect title="Computer accessibility" />
<revision>
<id>631144794</id>
<parentid>381202555</parentid>
<timestamp>2014-10-26T04:50:23Z</timestamp>
<contributor>
<username>Paine Ellsworth</username>
<id>9092818</id>
</contributor>
<comment>add [[WP:RCAT|rcat]]s</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">#REDIRECT [[Computer accessibility]]
{{Redr|move|from CamelCase|up}}</text>
<sha1>4ro7vvppa5kmm0o1egfjztzcwd0vabw</sha1>
</revision>
</page>
A sample of the file can be found here
From my perspective, I would think It's possible to read it in chunks, something like page per page in the file. Ans save each processed page
element as a line in a .csv
file.
I would like to have a data.frame with the following columns.
id, title and text.
How can I do to read this .xml
in chunks?