read and parse a xml in chunks in R

Question

I'm trying to read and process a ~5.8GB .xml from Wikipedia Dumps using R. I don't have so much RAM so I would like to process it in chunks. (Currently when using xml2::read_xml blocks my computer completely)

The file contais one xml element for each wikipedia page, like this:

<page>
    <title>AccessibleComputing</title>
    <ns>0</ns>
    <id>10</id>
    <redirect title="Computer accessibility" />
    <revision>
      <id>631144794</id>
      <parentid>381202555</parentid>
      <timestamp>2014-10-26T04:50:23Z</timestamp>
      <contributor>
        <username>Paine Ellsworth</username>
        <id>9092818</id>
      </contributor>
      <comment>add [[WP:RCAT|rcat]]s</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve">#REDIRECT [[Computer accessibility]]

{{Redr|move|from CamelCase|up}}</text>
      <sha1>4ro7vvppa5kmm0o1egfjztzcwd0vabw</sha1>
    </revision>
</page>

A sample of the file can be found here

From my perspective, I would think It's possible to read it in chunks, something like page per page in the file. Ans save each processed page element as a line in a .csvfile.

I would like to have a data.frame with the following columns.

id, title and text.

How can I do to read this .xml in chunks?

I'm not sure we're able to get your problem. The sample you provided us is small, so I can't really reproduce your problem. Have you tried something like [this](http://stackoverflow.com/questions/21222113/how-to-read-first-1000-lines-of-csv-file-into-r) (jlhoward answer)? — Tomás Barcellos, Nov 05 '16 at 18:46
Imagine a `.xml` with many, many elements like the one in the question. I can't just read line by line since it breaks the xml structure. I would like to read element by element, but I don't know how to do this... Obviously I linked to small sample, but you can download the full file here: https://dumps.wikimedia.org/ptwiki/20161101/ It's the ptwiki-20161101-pages-articles.xml.bz2 — Daniel Falbel, Nov 05 '16 at 19:11

Tomás Barcellos · Accepted Answer · 2019-12-16T19:58:53.900

It can be improved, but the main ideia is here. You still need to define the best way to define the amount of lines you're going to read in each interaction inside the readLines() function and also a method to read each chunk, but a solution for getting the chunks are here:

xml <- readLines("ptwiki-20161101-pages-articles.xml", n = 2000)

inicio <- grep(pattern = "<page>", x = xml)
fim <- grep(pattern = "</page>", x = xml)
if (length(inicio) > length(fim)) { # if you get more beginnings then ends
  inicio <- inicio[-length(inicio)] # drop the last one
}

chunks <- vector("list", length(inicio))

for (i in seq_along(chunks)) {
  chunks[[i]] <- xml[inicio[i]:fim[i]]
}

chunks <- sapply(chunks, paste, collapse = " ")

I've tried read_xml(chunks[1]) %>% xml_nodes("text") %>% xml_text() and it worked out.

read and parse a xml in chunks in R

1 Answers1