0

I am trying to read-in my documents into R. Everything loads fine but I receive 36 warning messages such as:

"18: In readLines(y, encoding = x$Encoding) : incomplete final line found on 'C:/text_data/2006DefenseWhitePaper.docx'"

Additionally, when I inspect my corpus it looks like this:

$\`1998DefenseWhitePaper.docx`
PK
l"%3÷Þ3VƃÑÚšl  µw%ë=–“^i7+Ù×ä-d&á”0ÞAÉ6€l4¼½L60#µÃ’ÍS
Oœ£œƒXø

For some reason the documents are encoded

Is this a formatting issue or are the sources from where i get the documents (online) encrypted.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
tg6784
  • 1
  • For one of those you can set `warn = FALSE` on the call to `readLines()`. Without the text we have no idea what else is going on. Any answer is more speculation than anything else. You could try using `stringi::stri_read_lines()` to see if that helps with encoding detection. – hrbrmstr Apr 23 '16 at 03:00
  • I am just trying to read-in simple text in a word document – tg6784 Apr 23 '16 at 03:15
  • If the Word document does not contain sensitive data, can you provide a link to it? Generally, if you only need the text, you should export it out of Word as plain text. If you there are tables in there you need, my [`docxtractr`](https://cran.rstudio.com/web/packages/docxtractr/) package can really help. If you need the layout for some reason, exporting to PDF then reading in with other R packages might be your solution. `readLines()` and `.docx` files aren't, sadly, going to work. – hrbrmstr Apr 23 '16 at 12:20

1 Answers1

1

You are encountering a similar problem to what is described in question: read an MSWord file into R

The reason for the warning received is the same described in the answer given by @neilfws.

Solution: There is a package called qdap which has a function known as read.transcript() which can be handy in accomplishing the task.

Community
  • 1
  • 1
Kunal Puri
  • 3,419
  • 1
  • 10
  • 22
  • that is a pretty task-specific function. if the OP's file is not in the format that Tyler says - https://github.com/trinker/qdap/wiki/Reading-.docx-%5BMS-Word%5D-Transcripts-into-R - it won't work. – hrbrmstr Apr 23 '16 at 12:20
  • @hrbrmstr Agreed. I had forgot about your docxtractor package. Nice. I've also been working on a package that brings together several document reading packages under one roof called textreadr that may be of use: https://github.com/trinker/textreadr The read_doc function is a bit OS dependant but haven't found a good portable solution. But `read_docx` may also be useful here though uses a similar approach to docxtractor. – Tyler Rinker Apr 24 '16 at 03:43