0

I need some help with loading text-file data into R for analysis with packages like koRpus.

The problem I am facing is getting R to recognize a folder full of Word files (about 4,000) as data which I can then make koRpus perform analyses like Coleman-Liau indexing. If at all possible, I prefer to make this work with Word files. The key problem is the struggle to cause R to recognize the text (Word) files in bulk (that is, all at the same time) so that koRpus can do its thing with those files.

My attempts to make this work have all been in vain, but I know that packages like koRpus would be limited in usefulness if there were no way to get the package to do its work on a large collection of files all at once.

I hope this problem will make sense to someone, and that there is a tenable solution to it.

Thanks, Gordon

Progman
  • 16,827
  • 6
  • 33
  • 48

1 Answers1

0

Looks like the readtext package should be able to help you out.

library(readtext)

Just specify the folder in the readtext() call. Like so:

doc_df <- 
  readtext("doc_files/")

I am not familiar with the koRpus package, but the text column in the created dataframe should contain what is needed for further function you want to use.

doc_df$text
#> [1] "Test1: a little bit of text" "Test2: no further text"     
#> [3] "Test3: lorem ipsum bla bla" 

In response to your comments:

It looks like your folder has several kinds of files in it and you are trying to filter them, so that only docx files are processed. The readtext command seems to support that kind of filtering, but the documentation says, that it is depending on the OS. My suggestion is to rather filter the files in the folder with R's dir() command, before calling readtext():

a <- dir("doc_files/", pattern = "docx", full.names = TRUE)
doc_df <- readtext(a)
Till
  • 3,845
  • 1
  • 11
  • 18
  • Thank you! I think this will be helpful, but I wonder if I can pose a followup: the following code library(readtext) folder <- readtext("/Users/Gordon/Desktop/WPSCASES/") produces an error message: Error: '/var/folders/q5/5npnr5mj5px7lnzxy8m2dg000000gn/T//RtmpgeqOWq/readtext-20aafecc5a624f2b6a5c05746b49a72d/word/document.xml' does not exist. In addition: Warning message: In utils::unzip(file, exdir = path) : error 1 in extracting from zip file The following also produces this message: library(readtext) folder <- readtext("/Users/Gordon/Desktop/WPSCASES/*.docx") – Gordon Ballingrud Nov 02 '20 at 19:21
  • the formatting is pretty messy there. Never used this forum before. I apologize for the impenetrable block-text format. – Gordon Ballingrud Nov 02 '20 at 19:23
  • Now, this code: texts <- readtext(paste0("/Users/Gordon/Desktop/WPSCASES/", "/word/*.docx")) Produces an error message: Error in list_files(file, ignore_missing, TRUE, verbosity) : File '' does not exist. – Gordon Ballingrud Nov 02 '20 at 19:57
  • I added a part about filtering file types to my response. Please remember to upvote/ mark as correct, if this helps you. – Till Nov 02 '20 at 20:12
  • I think it may have worked! I used this code: a <- dir("/Users/Gordon/Desktop/WPSCASES/", pattern="docx", full.names=TRUE) doc <- readtext(a) And now I have an object a which is described in R Studio as "Large character (4183 elements, 737.1 kb) and doc, which is 4183 obs of 2 variables. This is promising! Soon I'll start trying to perform my analyses. Thank you!!! – Gordon Ballingrud Nov 03 '20 at 04:05