1

Trying to load many email files and let R learn what's spam or ham. First, I created a corpus, I want to create a term document, I received an error. How to fix it?

email_corpus <- Corpus(VectorSource(NA))

setwd("C:/ham_spam/")

library(tm)
library(stringr)

email_corpus <- Corpus(VectorSource(NA))

folders <- c("easy_ham/", "spam_2/")

for(n in 1:2){
  folder <- folders[n]
  for(i in 1:length(list.files(folder))){
    email <- list.files(folder)[i]
    tmp <- readLines(str_c(folder, email))
    tmp <- str_c(tmp, collapse = "")
    tmp_corpus <- Corpus(VectorSource(tmp))
    email_corpus <- c(email_corpus, tmp_corpus)
  }
}

dtm_email <- DocumentTermMatrix(email_corpus)

Here is the error i received

Error in UseMethod("TermDocumentMatrix", x) : no applicable method for 'TermDocumentMatrix' applied to an object of class "list"

below is an example of email_corpus, email_corpus is a list of data frames.

$meta
$language
[1] "en"

attr(,"class")
[1] "CorpusMeta"

$dmeta
data frame with 0 columns and 1 row

$content
[1] "From Steve_Burt@cursor-system.com  Thu Aug 22 12:46:39 2002Return-Path: <Steve_Burt@cursor-system.com>Delivered-To: zzzz@localhost.netnoteinc.comReceived: from localhost (localhost [127.0.0.1])\tby phobos.labs.netnoteinc.com (Postfix) with ESMTP id BE12E43C34\tfor... <truncated>
Ploppy
  • 14,810
  • 6
  • 41
  • 58
Lin Ye
  • 11
  • 1

2 Answers2

0

Combining two corpora with c() removes the Corpus type by transforming it into a simple list.

On the other hand, using VCorpus and c() will preserve the VCorpus type.

Replace all your Corpus functions by VCorpus and the problem should be solved.

AshOfFire
  • 676
  • 5
  • 15
0

You can try this approach:

Set the working directory to the folder that contains both your ham and spam folders:

setwd('/path/to/dir/that/contains/folders/')

folders <- c("easy_ham/", "spam_2/")

You can then list all (in this case '.txt') files in you working directory (default path in list.files() is '.')

emails <- list.files(pattern = ".txt", # assuming all emails are .txt files
                     recursive = TRUE) # recurse listing in subdirs

library(stringr)
library(tm)

You can then use lapply() to read the files:

email_txt <- lapply(emails, function(x) {
  tmp <- readLines(x)
  tmp <- str_c(tmp, collapse = "")
  return(tmp)
})

Create a corpus from the read text:

email_corpus <- VCorpus(VectorSource(email_txt))

And finally create a DocumentTermMatrix from that corpus:

dtm_email <- DocumentTermMatrix(email_corpus)
clemens
  • 6,653
  • 2
  • 19
  • 31