0

I'm interested to remove all stopwords from my text using R. The list of stopwords that I want to remove can be found at http://www.ranks.nl/stopwords under the section which says "Long Stopword List" (a very long list version). I'm using tm package. Can one help me, please? Tnx!

iGada
  • 599
  • 3
  • 9
  • Does this answer your question? [delete stop words in R](https://stackoverflow.com/questions/44133509/delete-stop-words-in-r) – rajah9 Sep 12 '20 at 14:54
  • Please also take a look at the R documentation: https://www.rdocumentation.org/packages/qdap/versions/0.2.5/topics/stopwords . – rajah9 Sep 12 '20 at 14:54
  • The default English stopwords can be easily removed using `tm_map(text, removeWords, stopwords("en"))`. My problem is to consider all list of stopwords from the specified link. – iGada Sep 12 '20 at 15:01
  • 1
    Is your question on how to retrieve the lists of stop words or how to use `tm_map(text, removeWords, "any vector goes here" )`? – Dave2e Sep 12 '20 at 15:16
  • Sure! My question is how to directly access those lists. Is that possible? – iGada Sep 12 '20 at 15:18
  • Not really. I wanna to consider the most exhaustive list, however. – iGada Sep 12 '20 at 15:27
  • I already decided and indicated. Have you checked the link? In my case, words listed under the "Long Stopword List" are my focus. – iGada Sep 12 '20 at 15:31

1 Answers1

2

You can copy that list (after you select it in your browser) aand then paste it into this expression in R:

LONGSWS <- " <paste into this position> "

You would place the cursor for your editor or the IDE console device inside the two quotes. Then do this:

sw.vec <- scan(text=LONGSWS, what="")
#Read 474 items

The scan function needs to have the type of input specified via an example given to the what argument, and for that purpose just using "" is sufficient for character types. Then you should be able to apply the code you offered in your comment:

 tm_map(text, removeWords, sw.vec)

You have not supplied an example text object. Using just a character vector is not successful:

 tm_map("test of my text", removeWords, sw.vec )
#Error in UseMethod("tm_map", x) : 
#  no applicable method for 'tm_map' applied to an object of class "character"

So we will need to assume you have a suitable object of a suitable class to place in the first position of the arguments to tm_map. So using the example from the ?tm_map help page:

> res <- tm_map(crude, removeWords, sw.vec )
> str(res)
List of 20
 $ 127:List of 2
  ..$ content: chr "Diamond Shamrock Corp said \neffective today   cut  contract prices  crude oil \n1.50 dlrs  barrel.\n    The re"| __truncated__
  ..$ meta   :List of 15
  .. ..$ author       : chr(0) 
  .. ..$ datetimestamp: POSIXlt[1:1], format: "1987-02-26 17:00:56"
  .. ..$ description  : chr ""
  .. ..$ heading      : chr "DIAMOND SHAMROCK (DIA) CUTS CRUDE PRICES"
  .. ..$ id           : chr "127"
  .. ..$ language     : chr "en"
  .. ..$ origin       : chr "Reuters-21578 XML"
  .. ..$ topics       : chr "YES"
  .. ..$ lewissplit   : chr "TRAIN"
  .. ..$ cgisplit     : chr "TRAINING-SET"
   # ----------------snipped remainder of long output.
IRTFM
  • 258,963
  • 21
  • 364
  • 487