-3
txt <- readLines("this.txt")

library(tm)

corpus <- Corpus(VectorSource(txt))

corpus <- tm_map (corpus, removePunctuation)

tdm <- TermDocumentMatrix (corpus)

m <- as.matrix (tdm)

d <- data.frame(freq = sort(rowSums(m),decreasing = TRUE))
Sotos
  • 51,121
  • 6
  • 32
  • 66
Asma Souzii
  • 29
  • 1
  • 1
  • 2

3 Answers3

4

I think you're asking how to remove words like 'the' and 'this' using the tm library? If so, try this:

corpus <- tm_map(txt, removeWords, stopwords("english"))

To remove specific words:

corpus <- tm_map(corpus, removeWords, c("hello","is","it","me","you're","looking","for?"))

Edit: I created an example using War and Peace, which works. Try converting your terms to lower case before creating a document-term matrix. Like so:

library(tm)

# load
txt <- readLines("this.txt")
corpus <- Corpus(VectorSource(txt))

# clean
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("english")) 
corpus <- tm_map(corpus, PlainTextDocument)

# create dtm and get terms
dtm <- DocumentTermMatrix(corpus)
dtm$dimnames$Terms

Change the code to fit your text file and the output should similar to this:

dtm$dimnames$Terms
 [1] "almost"          "anonymous"       "anyone"          "anywhere"        "author"          "away"           
 [7] "aylmer"          "book"            "chapter"         "contents"        "copy"            "cost"           
[13] "date"            "david"           "ebook"           "english"         "give"            "gutenberg"      
[19] "iii"             "included"        "january"         "language"        "last"            "leo"            
[25] "license"         "louise"          "march"           "maude"           "may"             "one"            
[31] "online"          "peace"           "posting"         "project"         "restrictions"    "reuse"          
[37] "start"           "terms"           "title"           "tolstoy"         "tolstoytolstoi"  "translators"    
[43] "updated"         "use"             "vii"             "volunteer"       "war"             "whatsoever"     
[49] "widger"          "wwwgutenbergorg"
Oliver Frost
  • 827
  • 5
  • 18
  • I already try to do this but it doesn't work... I need to delete the words like : the ,in, and ... etc from d – Asma Souzii Apr 25 '16 at 08:38
  • I understand what you need, but be more specific about your data: What words are remaining? What language is your text in? Are the remaining words in upper case or lower case? If you have words like `The` and not `the` then you can try converting them to lower case. See my edit above. – Oliver Frost Apr 25 '16 at 09:40
1

Do you know what regular expressions are? You can try read here about R function gsub. Here's a little example how it works:

> let <- c("A", "B", "A", "C") # My vector of letters
> let
[1] "A" "B" "A" "C"
> # I want delete "A", so this letter I will replace with nothing ("")
> l <- gsub("A", "", let) # "A" replace by "" in vector let
> l
[1] ""  "B" ""  "C"

All you have to do now is delete empty elements if there are any.

And if you have only one symbol line, then gsub works:

> let <- " a b c d g h a a a"
> let
[1] " a b c d g h a a a"
> l <- gsub("a", "", let)
> l
[1] "  b c d g h   "
neringab
  • 613
  • 1
  • 7
  • 16
  • tnks ... but what if I have a matrix ... should I covert a matrix to liste ? how I can do that ? – Asma Souzii Apr 25 '16 at 08:50
  • gsub works between matrix elements too. No need to convert matrix to a list. But for you, in my opinion, better way to solution is to use examples written by Kipras or Oliver. I know about package 'tm' a little, so can't help to understand it more. – neringab Apr 25 '16 at 09:57
  • ok thnk y soooo much – Asma Souzii Apr 25 '16 at 11:39
0

It is hard to tell how your data looks like. But you can try to use gsub which is simple find replace function.

gsub("The", "", "HelloThe")

Which gives you

"Hello"
Kipras Kančys
  • 1,617
  • 1
  • 15
  • 20