txt <- readLines("this.txt")
library(tm)
corpus <- Corpus(VectorSource(txt))
corpus <- tm_map (corpus, removePunctuation)
tdm <- TermDocumentMatrix (corpus)
m <- as.matrix (tdm)
d <- data.frame(freq = sort(rowSums(m),decreasing = TRUE))
Asked
Active
Viewed 1.3k times
-3

Sotos
- 51,121
- 6
- 32
- 66

Asma Souzii
- 29
- 1
- 1
- 2
3 Answers
4
I think you're asking how to remove words like 'the' and 'this' using the tm
library? If so, try this:
corpus <- tm_map(txt, removeWords, stopwords("english"))
To remove specific words:
corpus <- tm_map(corpus, removeWords, c("hello","is","it","me","you're","looking","for?"))
Edit: I created an example using War and Peace, which works. Try converting your terms to lower case before creating a document-term matrix. Like so:
library(tm)
# load
txt <- readLines("this.txt")
corpus <- Corpus(VectorSource(txt))
# clean
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, PlainTextDocument)
# create dtm and get terms
dtm <- DocumentTermMatrix(corpus)
dtm$dimnames$Terms
Change the code to fit your text file and the output should similar to this:
dtm$dimnames$Terms
[1] "almost" "anonymous" "anyone" "anywhere" "author" "away"
[7] "aylmer" "book" "chapter" "contents" "copy" "cost"
[13] "date" "david" "ebook" "english" "give" "gutenberg"
[19] "iii" "included" "january" "language" "last" "leo"
[25] "license" "louise" "march" "maude" "may" "one"
[31] "online" "peace" "posting" "project" "restrictions" "reuse"
[37] "start" "terms" "title" "tolstoy" "tolstoytolstoi" "translators"
[43] "updated" "use" "vii" "volunteer" "war" "whatsoever"
[49] "widger" "wwwgutenbergorg"

Oliver Frost
- 827
- 5
- 18
-
I already try to do this but it doesn't work... I need to delete the words like : the ,in, and ... etc from d – Asma Souzii Apr 25 '16 at 08:38
-
I understand what you need, but be more specific about your data: What words are remaining? What language is your text in? Are the remaining words in upper case or lower case? If you have words like `The` and not `the` then you can try converting them to lower case. See my edit above. – Oliver Frost Apr 25 '16 at 09:40
1
Do you know what regular expressions are? You can try read here about R function gsub. Here's a little example how it works:
> let <- c("A", "B", "A", "C") # My vector of letters
> let
[1] "A" "B" "A" "C"
> # I want delete "A", so this letter I will replace with nothing ("")
> l <- gsub("A", "", let) # "A" replace by "" in vector let
> l
[1] "" "B" "" "C"
All you have to do now is delete empty elements if there are any.
And if you have only one symbol line, then gsub works:
> let <- " a b c d g h a a a"
> let
[1] " a b c d g h a a a"
> l <- gsub("a", "", let)
> l
[1] " b c d g h "

neringab
- 613
- 1
- 7
- 16
-
tnks ... but what if I have a matrix ... should I covert a matrix to liste ? how I can do that ? – Asma Souzii Apr 25 '16 at 08:50
-
gsub works between matrix elements too. No need to convert matrix to a list. But for you, in my opinion, better way to solution is to use examples written by Kipras or Oliver. I know about package 'tm' a little, so can't help to understand it more. – neringab Apr 25 '16 at 09:57
-
0
It is hard to tell how your data looks like. But you can try to use gsub which is simple find replace function.
gsub("The", "", "HelloThe")
Which gives you
"Hello"

Kipras Kančys
- 1,617
- 1
- 15
- 20
-
-
I have a matrix d and I need to remove some words like : the , and ... etc the corpus <- tm_map(txt, removeWords, stopwords("english")) doesn't work :'( – Asma Souzii Apr 25 '16 at 08:46