R remove specific word in a txte like: the this

Question

txt <- readLines("this.txt")

library(tm)

corpus <- Corpus(VectorSource(txt))

corpus <- tm_map (corpus, removePunctuation)

tdm <- TermDocumentMatrix (corpus)

m <- as.matrix (tdm)

d <- data.frame(freq = sort(rowSums(m),decreasing = TRUE))

Oliver Frost · Accepted Answer · 2016-04-25T10:40:07.527

I think you're asking how to remove words like 'the' and 'this' using the tm library? If so, try this:

corpus <- tm_map(txt, removeWords, stopwords("english"))

To remove specific words:

corpus <- tm_map(corpus, removeWords, c("hello","is","it","me","you're","looking","for?"))

Edit: I created an example using War and Peace, which works. Try converting your terms to lower case before creating a document-term matrix. Like so:

library(tm)

# load
txt <- readLines("this.txt")
corpus <- Corpus(VectorSource(txt))

# clean
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("english")) 
corpus <- tm_map(corpus, PlainTextDocument)

# create dtm and get terms
dtm <- DocumentTermMatrix(corpus)
dtm$dimnames$Terms

Change the code to fit your text file and the output should similar to this:

dtm$dimnames$Terms
 [1] "almost"          "anonymous"       "anyone"          "anywhere"        "author"          "away"           
 [7] "aylmer"          "book"            "chapter"         "contents"        "copy"            "cost"           
[13] "date"            "david"           "ebook"           "english"         "give"            "gutenberg"      
[19] "iii"             "included"        "january"         "language"        "last"            "leo"            
[25] "license"         "louise"          "march"           "maude"           "may"             "one"            
[31] "online"          "peace"           "posting"         "project"         "restrictions"    "reuse"          
[37] "start"           "terms"           "title"           "tolstoy"         "tolstoytolstoi"  "translators"    
[43] "updated"         "use"             "vii"             "volunteer"       "war"             "whatsoever"     
[49] "widger"          "wwwgutenbergorg"

I already try to do this but it doesn't work... I need to delete the words like : the ,in, and ... etc from d — Asma Souzii, Apr 25 '16 at 08:38
I understand what you need, but be more specific about your data: What words are remaining? What language is your text in? Are the remaining words in upper case or lower case? If you have words like `The` and not `the` then you can try converting them to lower case. See my edit above. — Oliver Frost, Apr 25 '16 at 09:40

score 1 · Answer 2 · answered Apr 24 '16 at 14:43

1

Do you know what regular expressions are? You can try read here about R function gsub. Here's a little example how it works:

> let <- c("A", "B", "A", "C") # My vector of letters
> let
[1] "A" "B" "A" "C"
> # I want delete "A", so this letter I will replace with nothing ("")
> l <- gsub("A", "", let) # "A" replace by "" in vector let
> l
[1] ""  "B" ""  "C"

All you have to do now is delete empty elements if there are any.

And if you have only one symbol line, then gsub works:

> let <- " a b c d g h a a a"
> let
[1] " a b c d g h a a a"
> l <- gsub("a", "", let)
> l
[1] "  b c d g h   "

answered Apr 24 '16 at 14:43

neringab

613
1
7
16

tnks ... but what if I have a matrix ... should I covert a matrix to liste ? how I can do that ? – Asma Souzii Apr 25 '16 at 08:50
gsub works between matrix elements too. No need to convert matrix to a list. But for you, in my opinion, better way to solution is to use examples written by Kipras or Oliver. I know about package 'tm' a little, so can't help to understand it more. – neringab Apr 25 '16 at 09:57
ok thnk y soooo much – Asma Souzii Apr 25 '16 at 11:39

score 0 · Answer 3 · answered Apr 24 '16 at 14:40

0

It is hard to tell how your data looks like. But you can try to use gsub which is simple find replace function.

gsub("The", "", "HelloThe")

Which gives you

"Hello"

answered Apr 24 '16 at 14:40

Kipras Kančys

1,617
1
15
20

I'm sorry but R make my life miserable :'( :p – Asma Souzii Apr 25 '16 at 08:43
I have a matrix d and I need to remove some words like : the , and ... etc the corpus <- tm_map(txt, removeWords, stopwords("english")) doesn't work :'( – Asma Souzii Apr 25 '16 at 08:46

R remove specific word in a txte like: the this

3 Answers3