Defining own stopwords by their beginning

Asked Apr 30 '18 at 15:28

Active May 29 '18 at 01:02

Viewed 26 times

I'm looking for a code, which allows me to delete own stopwords from my textcorpus, but only with defining them by their beginning

example: In my corpus that contains newspaper articles, there are also additional htpps.... internet links included, which I do not need for my topic modeling.

I now want to delete all "words" which begin with "https..."

Is there any way I can do this?

I am using the tm package for text transformations and till this point also used some own stopwords.

CODE

nzz <- SimpleCorpus(DirSource("private"), control = list(language="de"))

nzz <- tm_map(nzz, removePunctuation)
nzz <- tm_map(nzz, removeNumbers)
nzz <- tm_map(nzz, stripWhitespace)
**myStopwords <- c("beispiel","bemerkbar","docs","par",**
                 **"ipar","neue","zuercher","zeitung","http")**

**nzz <- tm_map(nzz, removeWords, c(stopwords("german"), myStopwords))****

edited Jun 20 '20 at 09:12

Community

asked Apr 30 '18 at 15:28

Alessio Levis

No need for a heavy machinery. Have you tried filtering with regular expressions before you feed the data into furher transformations? – sophros Apr 30 '18 at 15:30
Can you provide some sample code? – sɐunıɔןɐqɐp Apr 30 '18 at 15:40
I added some code – Alessio Levis May 01 '18 at 06:40

Defining own stopwords by their beginning

0 Answers0