0

I am following the tutorials of Machine Learning for Hackers (https://github.com/johnmyleswhite/ML_for_Hackers) and I am using Sublime Text as a text editor. To run my code, I use SublimeREPL R.

I am using this code, taken directly from the book:

setwd("/path/to/folder")
# Load the text mining package
library(tm)
library(ggplot2)

# Loading all necessary paths
spam.path <- "data/spam/"
spam2.path <- "data/spam_2/"
easyham.path <- "data/easy_ham/"
easyham.path2 <- "data/easy_ham_2/"
hardham.path <- "data/hard_ham/"
hardham2.path <- "data/hard_ham_2/"

# Get the content of each email
get.msg <- function(path) {
    con     <- file(path, open = "rt", encoding = "latin1")
    text    <- readLines(con)
    msg     <- text[seq(which(text == "")[1] + 1, length(text),1)]
    close(con)

    return(paste(msg, collapse = "\n"))
}

# Create a vector where each element is an email
spam.docs   <- dir(spam.path)
spam.docs   <- spam.docs[which(spam.docs != "cmds")]
all.spam    <- sapply(spam.docs, function(p) get.msg(paste(spam.path, p, sep = "")))

# Log the spam
head(all.spam)

This piece of code works fine in RStudio (with the data provided here: https://github.com/johnmyleswhite/ML_for_Hackers/tree/master/03-Classification) but when I run it in Sublime, Iget the following error message:

> all.spam <- sapply(spam.docs,
+                    function(p) get.msg(file.path(spam.path, p)))
Error in seq.default(which(text == "")[1] + 1, length(text), 1) : 
  'from' cannot be NA, NaN or infinite
In addition: Warning messages:
1: In readLines(con) :
  invalid input found on input connection 'data/spam/00006.5ab5620d3d7c6c0db76234556a16f6c1'
2: In readLines(con) :
  invalid input found on input connection 'data/spam/00009.027bf6e0b0c4ab34db3ce0ea4bf2edab'
3: In readLines(con) :
  invalid input found on input connection 'data/spam/00031.a78bb452b3a7376202b5e62a81530449'
4: In readLines(con) :
  incomplete final line found on 'data/spam/00031.a78bb452b3a7376202b5e62a81530449'
5: In readLines(con) :
  invalid input found on input connection 'data/spam/00035.7ce3307b56dd90453027a6630179282e'
6: In readLines(con) :
  incomplete final line found on 'data/spam/00035.7ce3307b56dd90453027a6630179282e'
> 

I get the same results when I take the code from John Myles White's repo.

How can I fix this?

Thanks

Spearfisher
  • 8,445
  • 19
  • 70
  • 124
  • Try using the full paths of your spam/ham and see if you get the same `seq` error. – AGS Jun 04 '14 at 10:09

1 Answers1

0

I think the problem got is in using encoding=latin1, you can just remove this one, I test it in my environment, it ran well.

spam.docs <- paste(spam.path,spam.docs,sep="")

all.spam <- sapply(spam.docs,get.msg) Warning message: In readLines(con) : incomplete final line found on 'XXXXXXXXXXXXXXXXX/ML_for_Hackers-master/03-Classification/data/spam/00136.faa39d8e816c70f23b4bb8758d8a74f0'

still some warnnings in it, but it can produce the results well.

Thanks.