0

I am mining Twitter data and one of the problems I come across while cleaning text is, being unable to remove/separate conjoint words that are usually hashtag data. Upon removing special characters and symbols like '#', I am left with phrases that make no sense. For instance:

1) Meaningless words: I have words like: 'spillwayjfleck' , 'bowhunterva' etc, which make no sense and need to be removed from my Corpus. Is there any function in R which can do it?.

2) Conjoint words: I need a method to separate joint words like: 'flashfloodwarn' to: 'flash', 'flood', 'warn', from my Corpus.

Any help would be appreciated.

Sisir
  • 1
  • 3
  • You can remove words with tm_map(corpus_train, removeWords, c("thewordsyouwannaremov")) – Nico Coallier Apr 06 '17 at 02:35
  • See http://stackoverflow.com/questions/30983495/how-to-split-a-text-into-two-meaningful-words-in-r – Nico Coallier Apr 06 '17 at 02:50
  • @NicoCoallier that's very cumbersome. My text file is huge and my corpus has a million words that don't make sense. I can't manually enter each and every word. Is there a solution that compares the words in my Corpus to words from the dictionary and eliminates all the meaningless words? – Sisir Apr 06 '17 at 02:51
  • @NicoCoallier thanks. I'll look it up. – Sisir Apr 06 '17 at 02:51
  • @NicoCoallier that post gives me a direction. But, again, I will have to plug-in words and check for every possible split. Isn't there an alternative? – Sisir Apr 06 '17 at 02:55
  • Can I see how your data looks ? – Nico Coallier Apr 06 '17 at 02:56
  • You can apply a function to a dataframe or a list – Nico Coallier Apr 06 '17 at 03:00
  • http://stackoverflow.com/questions/26715380/extract-english-words-from-a-text-in-r – Nico Coallier Apr 06 '17 at 03:02
  • If you give me an example of your data I can make you a function to do what you want – Nico Coallier Apr 06 '17 at 03:03
  • @NicoCoallier I am mining disaster-event tweets. My data is based on tweets on the Oroville Dam Spillway. For instance: RT PaulRogersSJMN: How fast are #OrovilleDam operators trying to drain the lake? 748,000 gallons per second. At about 50 mph. That's an im… #LahoreBlast #LEMONADE #OrovilleDam #VuraOnAMilli #Quantico #RAWVegas #stopprofiling #TheBachelor #ukpunday #VGPusuanSaAraneta Unbelievable how political the feed on #OrovilleDam is. – Sisir Apr 06 '17 at 03:09
  • But how is it organise can you show me à head() or inspect() – Nico Coallier Apr 06 '17 at 03:19
  • @NicoCoallier here you go: > head(orv_tweets,4) [1] "Understand where those #Flood waters came from: #OrovilleDam #OrovilleSpillway\nSashaPezenik" [2] "This is how those drastic water levels are possible. #OrovilleDam bfwebster zerohedge" [3] "#OrovilleDam overview mincing words HealthRanger LouDobbs KellyannePolls BreitbartNews starsandstripes" [4] "#OrovilleDam provides drinking water for about 25 million #Californians, and water for farmers, fish and wildlife CAWaterAlliance" – Sisir Apr 07 '17 at 21:47
  • Do you want to remove all hashtags? or only the ones that have conjoined words? – Fred Boehm Apr 10 '17 at 02:31
  • @FredBoehm I wish to remove all hashtags. But, I need to separate the conjoined words into separate words. – Sisir Apr 10 '17 at 03:07

0 Answers0