0

I performed text mining on files that I am preparing for publication right now. There are several XML files that contain text within segments (see basic example below). Due to copyright restrictions, I have to make sure that the files that I am going to publish do not contain the whole text while someone who has the texts should be able to 'reconstruct' the files. To make sure that one can still perform basic text mining (= count lengths), the segment length should not change. Therefore I am looking for a way to replace every word except for the first and last one in all segments with dummy / placeholder text.

Basic example:

Input:

<text>
<div>
<seg xml:id="A">Lorem ipsum dolor sit amet</seg>
<seg xml:id="B">sed diam nonumy eirmod tempor invidunt</seg>
</div>
</text>

Output:

<text>
<div>
<seg xml:id="A">Lorem blank blank blank amet</seg>
<seg xml:id="B">sed blank blank blank blank invidunt</seg>
</div>
</text>
macright
  • 17
  • 4

1 Answers1

1

There is rapply to recursively replace values in a nested list:

Let be data.xml containing your input.

library(tidyverse)
library(xml2)

read_xml("data.xml") %>%
  as_list() %>%
  rapply(how = "replace", function(x) {
    tokens <-
      x %>%
      str_split(" ") %>%
      simplify()
    
    n_tokens <- length(tokens)
    
    c(
      tokens[[1]],
      rep("blank", n_tokens - 2),
      tokens[[n_tokens]]
    ) %>%
      paste0(collapse = " ")
  }) %>%
  as_xml_document() %>%
  write_xml("data2.xml")

Output file data2.xml:

<?xml version="1.0" encoding="UTF-8"?>
<text>
  <div>
    <seg id="A">Lorem blank blank blank amet</seg>
    <seg id="B">sed blank blank blank blank invidunt</seg>
  </div>
</text>
danlooo
  • 10,067
  • 2
  • 8
  • 22
  • Thank you! I was already trying to replace text starting from a given position – your code is just so much cleaner and I'm still struggling with the pipe operator in my own work – but I couldn't figure out a way where I keep a segment's last word as well so that it is easier to reconstruct the original file. Can you think of any solution? – macright Feb 16 '22 at 12:34
  • 1
    @macright I revised my answer. The pipe operator (also `|>`) is just a 'then operator', meaning take this, then add 1 to it, then replace some digitis, then write the final result to a file. – danlooo Feb 16 '22 at 12:39