0

When I try to use strsplit on plain text, it has the desired property that the value stored is transformed from a string of characters to a vector with strings of characters. For example,

txt = "The fox is Brown.\nThe Fox has a tail."
strsplit(txt, "\n")

For the actual problem I'm using the NLP package, tm (v0.7-1) in R 3.4.0 on Windows 7.

When I create my corpus and try to use the content_transformer function in tm, it breaks my corpus up instead of returning a vector of the content.

require(tm) #version 0.7-1
txt = "The fox is Brown.\nThe Fox has a tail."
docs = Corpus(VectorSource(txt))
to_newline = content_transformer(function (x) unlist(strsplit(x, "\n")))
docs = tm_map(docs, to_newline)
str(docs)

The output from str(docs) in the code above looks like:

List of 2 
 $ 1:List of 2 
  ..$ content: chr "The fox is Brown." 
  ..$ meta   :List of 7
  .. ..$ author       : chr(0) 
  .. ..$ datetimestamp: POSIXlt[1:1], format: "2017-06-25 15:11:55"
  .. ..$ description  : chr(0) 
  .. ..$ heading      : chr(0) 
  .. ..$ id           : chr "1"
  .. ..$ language     : chr "en"
  .. ..$ origin       : chr(0) 
  .. ..- attr(*, "class")= chr "TextDocumentMeta"
  ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
 $ 2:List of 2
  ..$ content: chr "The Fox has a tail."
  ..$ meta   :List of 7
  .. ..$ author       : chr(0) 
  .. ..$ datetimestamp: POSIXlt[1:1], format: "2017-06-25 15:11:55"
  .. ..$ description  : chr(0) 
  .. ..$ heading      : chr(0) 
  .. ..$ id           : chr "2"
  .. ..$ language     : chr "en"
  .. ..$ origin       : chr(0) 
  .. ..- attr(*, "class")= chr "TextDocumentMeta"
  ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
 - attr(*, "class")= chr [1:2] "SimpleCorpus" "Corpus"

I want it to look like the following where $ content is a vector of characters:

List of 1 
 $ 1:List of 2 
  ..$ content: chr [1:2] "The fox is Brown." "The Fox has a tail." 
  ..$ meta   :List of 7
  .. ..$ author       : chr(0) 
  .. ..$ datetimestamp: POSIXlt[1:1], format: "2017-06-25 15:11:55"
  .. ..$ description  : chr(0) 
  .. ..$ heading      : chr(0) 
  .. ..$ id           : chr "1"
  .. ..$ language     : chr "en"
  .. ..$ origin       : chr(0) 
  .. ..- attr(*, "class")= chr "TextDocumentMeta"
  ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
Aaron
  • 379
  • 5
  • 14

1 Answers1

0

This took a lot of trial and error. In fact, I was using DirSource in order to read in a corpus of data and all I needed to do was convert the function sequence reading in the corpus to say VCorpus(DirSource(directory_name), ...).

In order to demonstrate the problem create a text file:

The fox is Brown.
The Fox has a tail.

Save the file in your working directory in a folder named test and save the file as test.txt.

Then run:

docs = VCorpus(DirSource("./test"))
str(docs)

Notice how the content is not a character vector!!!

Aaron
  • 379
  • 5
  • 14