1

I am new to R and text mining. I had made a word cloud out of twitter feed related to some term. The problem that I'm facing is that in the wordcloud it shows http:... or htt... How do I deal about this issue I tried using metacharacter * but I still doubt if I'm applying it right

tw.text = removeWords(tw.text,c(stopwords("en"),"rt","http\\*"))

somebody into text-minning please help me with this.

  • 1
    You could just use `gsub` to your original data. Please post a little piece of your data, text to remove and desired output. – SabDeM Jul 29 '15 at 14:31
  • I was trying to fetch some tweets `head(tweets,10) [1] "@amitkumarpatil2 @bdutt yes.\nhttp://t.co/6v2n4EHeoc" @mihirssharma http://t.co/WHnaJmUNNG" [7] "RT @QLDMackay: Cheap power or clean energy? Modi's $275 billion Indian dilemma http://t.co/YEaaHodO6p ... https://t.co/zfV2XRKwfl" ` So, they include URL's to news pages etc – Amanpreet Singh Jul 29 '15 at 23:11
  • I tried gsub() too, but it only removes "http:" rest URL //xyz.com still remains – Amanpreet Singh Jul 29 '15 at 23:25

2 Answers2

3

If you are looking to remove URLs from your string, you may use:

gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)

Where x would be:

x <- c("some text http://idontwantthis.com", 
         "same problem again http://pleaseremoveme.com")

It would be easier to provide you with a specific answer if you could post sample of your data but the following example would give you a clean text with no URLs:

> clean_x <- gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)
> clean_x
[1] "some text "          "same problem again "

As a side point, I would suggest that it may be worth searching for the existing methods to clean text before mining. For example the clean function discussed here would enable you to do this automatically. On similar lines, there are function to clean your text from tweets (#,@), punctuation and other undesirable entries.

Konrad
  • 17,740
  • 16
  • 106
  • 167
1

Apply the below code to corpus to replace a string pattern with space. String pattern can be urls or terms you want to remove from the wordcloud. For example to remove terms starting with https:

replace with space

toSpace = content_transformer( function(x, pattern) gsub(pattern," ",x) )

tweet_corpus_clean = tm_map( tweet_corpus, toSpace, "https*")

Or pass a pattern as below to remove urls

tweet_corpus_clean = tm_map( tweet_corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")

Community
  • 1
  • 1