-2

I am trying to create an hashtag extraction function in R. This function will extract a hashtags from a post, if there are any, else will give a blank. My function is like

hashtag_extract= function(text){
              match = str_extract_all(text,"#\\S+")
              if (match) { 
                 return match
                 }else{
               return ''}}
String="#letsdoit #Tonewbeginnign world is on a new#route

But my function is not working, showing me tons of errors.like 1st error is

Error: unexpected symbol in:
      "  if (match) { 
     return match"

so I want to apply it as

hashatag_extract(string)

and answer should come like

#letsdoit  ##Tonewbeginnign   #route

And eventually I will use sapply to apply this function on whole column, that's why the If part is important. Please ignore my indentation for R, since its not important for R, but every suggestion will be helpful

Manu Sharma
  • 1,593
  • 4
  • 25
  • 48

3 Answers3

11
  1. Hashtag regexes aren't that simple
  2. I'm not sure you understand the commonly accepted "rules" for hashtags
  3. I do not believe str_extract_all() is returning what you think it is
  4. Just use stringi which stringr functions are built on top of
  5. Folks rly need to stop analyzing tweets

This should handle most, if not all, cases:

get_tags <- function(x) {
  # via http://stackoverflow.com/a/5768660/1457051
  twitter_hashtag_regex <- "(^|[^&\\p{L}\\p{M}\\p{Nd}_\u200c\u200d\ua67e\u05be\u05f3\u05f4\u309b\u309c\u30a0\u30fb\u3003\u0f0b\u0f0c\u00b7])(#|\uFF03)(?!\uFE0F|\u20E3)([\\p{L}\\p{M}\\p{Nd}_\u200c\u200d\ua67e\u05be\u05f3\u05f4\u309b\u309c\u30a0\u30fb\u3003\u0f0b\u0f0c\u00b7]*[\\p{L}\\p{M}][\\p{L}\\p{M}\\p{Nd}_\u200c\u200d\ua67e\u05be\u05f3\u05f4\u309b\u309c\u30a0\u30fb\u3003\u0f0b\u0f0c\u00b7]*)"
  stringi::stri_match_all_regex(x, hashtag_regex) %>% 
    purrr::map(~.[,4]) %>% 
    purrr::flatten_chr()

}

tests <- c("#teste_teste      //underscore accepted",
           "#teste-teste      //Hyphen not accepted",
           "#leof_gfg.sdfsd   //dot not accepted",
           "#f34234@45#6fgh6  // @ not accepted",
           "#leo#leo2#asd     //followed hastag without space ",
           "#6663             // only number accepted",
           "_#asd_            // hashtag can't start or finish with underscore",
           "-#sdfsdf-         // hashtag can't start or finish with hyphen",
           ".#sdfsdf.         // hashtag can't start or finish with dot",
           "#leo_leo__leo__leo____leo // decline followed underline")


get_tags(tests)
##  [1] "teste_teste"              "teste"                   
##  [3] "leof_gfg"                 "f34234"                  
##  [5] "leo"                      NA                        
##  [7] NA                         "sdfsdf"                  
##  [9] "sdfsdf"                   "leo_leo__leo__leo____leo"

your_string <- "#letsdoit #Tonewbeginnign world is on a new#route"

get_tags(your_string)
## [1] "letsdoit"       "Tonewbeginnign"

You'll need to tweak the function if you need each set of hashtags to be grouped with each input vector but you really didn't provide much detail on what you're really trying to accomplish.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
1

@manu sharma I would say you need not apply if else inside. Let the non-matching rows take values as 'NA'. And after applying the function you change it to blank. Hope my code helps you:

   aaa <- readLines("C:\\MY_FOLDER\\NOI\\file2sample.txt")
 ttt <- function(x){

  r <- sapply(x, function(x) { matches <- str_match(x,"#\\w+\\s+")})
  r


  }

 y <-ttt(aaa)
 y[is.na(y)]=''
Shalini Baranwal
  • 2,780
  • 4
  • 24
  • 34
0

Thanks everyone for all the help, I got it worked somehow, thought it is almost similar as Shalini's answer 1.replacing all NAs on message

message[is.na(message)]='abc'

2.function for extracting the Hashtags

hashtag_extrac= function(text){
match = str_extract_all(text,"#\\S+")
if (match!= "") { 
match
} else {
'' }}

applying function on whole column

hashtags= sapply(message, hashtag_extrac)
Manu Sharma
  • 1,593
  • 4
  • 25
  • 48
  • Why is that if statement there? It doesn't do anything... If it's not blank, then do nothing. If it is blank, make it blank. I'm baffled why you don't use the much higher quality answer above. – cory Aug 08 '16 at 12:29
  • Thank you so much! But I will request, keep calm, even in scripts, we have our own cases and uses, which some of the times we can't explain in a que and certainly they are better answers – Manu Sharma Aug 08 '16 at 13:45
  • So you accept @Shalini's answer - I understand it that way, or did I misread it? – Dilettant Aug 09 '16 at 05:32