1

I am working on a set with dimensions

dim(data)
[1] 419612      2

Where second column look more-or-lesslike this:

> unique(data[1:50,"topics"])
[1] {"dom":2.0,"moda":3.0,"rodzina":1.55,"praca":1.42,"finanse":1.96,"edukacja":1.67,"sport":1.96,"muzyka":1.52,"kuchnia":1.8,"plotka":1.8,"zdrowie":1.12,"kibic":1.8,"uroda":2.32,"gra":2.94,"motoryzacja":1.33,"kultura":1.42,"film":3.14,"podróż":1.9,"technologia":1.31}
[2] {"rodzina":2.99,"kultura":4.46,"muzyka":4.5}                                                                                                                                                                                                                            
[3] {"dom":1.93,"rodzina":5.37,"zwierzęta":3.0,"praca":4.3,"finanse":2.11,"sport":2.1,"muzyka":2.99,"nieruchomość":2.8,"kuchnia":6.4,"plotka":2.1,"zdrowie":3.79,"gra":4.25,"motoryzacja":2.57,"kultura":3.13,"film":4.4,"podróż":3.21}                                     
[4] {"plotka":9.5,"uroda":10.06,"kultura":15.67,"muzyka":29.97}                                                                                                                                                                                                             
[5] {"dom":2.99,"rodzina":2.5,"edukacja":3.85,"sport":1.17,"muzyka":1.23,"nieruchomość":2.95,"kuchnia":1.42,"wnętrze":1.33,"kibic":1.17,"ogród":1.33,"motoryzacja":1.17,"film":1.17,"podróż":1.57}                                                                          
[6] {"kuchnia":4.38,"plotka":1.33,"rodzina":1.61,"film":1.33}                                                                                                                                                                                                               
37530 Levels: {"biznes":1.0} ... {"zwierzęta":9.96,"podróż":9.97}

For each row I'd like to choose te word from topics column that have the highest grade after : sign. I tried to use mutate function from dplyr package it looks like it did not work. Opeartions on characters where made with stringi package that are a faster version of stringr. My code and resultof this operation is below. Anyone knows why I get the same value in every row after this operation, and how to achieve the desired result without using for loop?

> data2 <- data %>%
+   mutate( xx = topics %>%
+             stri_extract_all_regex(pattern = "[a-zA-Z0-9óśćłźżęą\\.\\s]+") %>% 
+             unlist %>% 
+             data.frame( topic = .[seq(1,length(.), by=2)], 
+                         waga = .[seq(2,length(.), by=2)] )  %>% 
+             select( topic, waga) %>% arrange( desc( waga)) %>%
+             unique() %>%
+             .[1,1]
+             )
> table(data2$xx)[ which(table(data2$xx) > 1) ]
kuchnia 
 419612 

I've added extra column nr that is a row number, and then I've stupidly group_byed on that column and summarised instead of mutate and achived what I desired... but I'm not proud of my code. Any other ideas?

daneBC1 <- data %>% 
  group_by( nr)  %>%
  summarise( bc1 = topics %>%
               stri_extract_all_regex(pattern = "[a-zA-Z0-9óśćłźżęą\\.\\s]+") %>% 
               unlist %>% 
               data.frame( topic = .[seq(1,length(.), by=2)], 
                           waga = .[seq(2,length(.), by=2)] )  %>% 
               select( topic, waga) %>% arrange( desc( waga)) %>%
               unique() %>%
               .[1,1] )



daneBC1$bc1 %>% table

        dom    edukacja        film     finanse         gra       kibic     kuchnia     kultura 
     119802       79487       55569       38134       30425       21757       16371       12356 
       moda motoryzacja      muzyka      plotka      podróż       praca     rodzina       sport 
      11103        7264        6357        4855        3520        3005        2317        2183 
technologia       uroda     zdrowie 
       1441        1055         740 

Sample data

library(archivist)
data <- loadFromGithubRepo( "97f74c5a10f510cce39eafb0d9a1a9e8", 
user="MarcinKosinski", repo="Museum", value = TRUE )
Marcin
  • 7,834
  • 8
  • 52
  • 99
  • Why do you want to use regular expressions rather then reading this as JSON..? Also have you checked if the problem is not that the data is saved as factor rather then character (why factor in here?)? – Tim Apr 19 '15 at 14:38
  • Btw, can you provide sample data? – Tim Apr 19 '15 at 14:39
  • @Tim I've updated my comment with sample data at the end. – Marcin Apr 19 '15 at 14:56
  • @Tim Now I see that applying `fromJSON` function from `rjson` package to every row might be more readable, but as you see, simple regex also works :) but `mutate`function does not .. – Marcin Apr 19 '15 at 15:01
  • Yes, but I would still stick to one of the JSON libraries because (a) they are designed for such data structures and so are probably less error prone, (b) are probably faster then using regex. If performance is an issue in your project I would check it. – Tim Apr 19 '15 at 15:10
  • Thanks for suggestion. I'll remember about that in the future. – Marcin Apr 19 '15 at 15:16
  • Whoah @MrFlick! Now I see this. Any Idea on how to perform this operation using dply-fast-functions? – Marcin Apr 19 '15 at 16:05

1 Answers1

2

Your mutate() function is not "vectorized". Mutate doesn't operate on a row at a time, it operates on entire columns as vectors. Your unlist and and .[1,1] extraction are taking the values for all rows and collapsing down to one vector and one value.

You can make a vectorized tranformation function with

extr <- Vectorize(. %>%
         stri_extract_all_regex(pattern = "[a-zA-Z0-9óśćłźżęą\\.\\s]+") %>% 
         unlist %>% 
         data.frame( topic = .[seq(1,length(.), by=2)], 
                     waga = .[seq(2,length(.), by=2)] )  %>% 
         select( topic, waga) %>% arrange( desc( waga)) %>%
         unique() %>%
         .[1,1])

and then use it with

data %>% mutate( xx = extr(topics))

although I agree with others that since you have JSON data, it would be better to properly parse this data with a JSON parser rather than trying to re-invent the wheel with regular expressions.

MrFlick
  • 195,160
  • 17
  • 277
  • 295