How to encode text correctly when importing word documents into R?

Question

I am trying to import content of multiple word documents into the same object in R. I am following Julia Silge and David Robinson's guide (see here: https://www.tidytextmining.com/usenet.html).

I am unable to figure out how to encode "text" column correctly while importing.

Here is the code I am using:

# Define a function to read all files from a folder into a data frame

  read_folder <- function(infolder) {
  tibble(file = dir(infolder, full.names = TRUE)) %>%
  mutate(text = map(file, read_lines)) %>%
  transmute(id = basename(file), text) %>%
  unnest(text)
  }

# Use unnest() and map() to apply read_folder to each subfolder

  raw_text <- tibble(folder = dir(training_folder, full.names = TRUE)) %>%
  unnest(map(folder, read_folder)) %>%
  transmute(newsgroup = basename(folder), id, text)

Here is an example of the resulting text column:

 <f7><e5><95><e3><a9>O<af><a5><fa> PK

Will I have to change the encoding after importing the data?

I think their example is reading a text file. Can you save your Word doc as raw text? Or if you want to use the word docs as-is, check out the `officer` packager for a way to import the underlying text. https://davidgohel.github.io/officer/ — Jon Spring, Feb 22 '19 at 23:37
Thanks - I was hesitant at fist to convert to txt because I have over 2000 files. I eventually batch converted all the files using terminal and imported them. — Anavir, Feb 23 '19 at 02:45

How to encode text correctly when importing word documents into R?

0 Answers0