-1

I have a problem in doing matrix computation, could you please shed some light upon it. Thank you very much in advance!

I have a data frame genderLocation and a matrix test, they correspond to each other with the index

genderLocation[,1:6]

          scanner_gender cmall_gender wechat_gender scanner_location cmall_location wechat_location
    156043              3            2             2             Guangzhou           Shenzhen            Shenzhen
    156044              2           NA            NA             Shenzhen           <NA>                
    156045              2           NA             2             Shenzhen           <NA>            Hongkong
    156046              2           NA             2             Shenzhen           <NA>            Shenzhen

test

        [,1] [,2] [,3] [,4] [,5] [,6]
    [1,]  0.8  0.7  0.6  0.6  0.7  0.7
    [2,]  0.8  1.0  1.0  0.6  0.7  0.7
    [3,]  0.8  1.0  0.6  0.6  0.7  0.7
    [4,]  0.8  1.0  0.6  0.6  0.7  0.7

Now I wanna aggregate genderLocation, compute the averages of their corresponding digits in matrix test. Take 156043 row for example, the results should be

      2    3 Guangzhou Shenzhen 
    0.65 0.80 0.60 0.70 

I dont know how to do it using the apply family(as it is not suggested to using for-loops in R). This seems to be

    > apply(test,1,function(tst,genderLoc) print(tapply(tst,as.character(genderLoc),mean)),genderLocation)

but I cannot understand the results, if limiting to the first 2 rows, it seems understandable.

    > apply(test[1:2,],1,function(tst,genderLoc) print(tapply(tst,as.character(genderLoc),mean)),genderLocation[1:2,])
           c("2", NA)       c("3", "2") c("广州", "深圳")     c("深圳", "")     c("深圳", NA) 
                 0.65              0.80              0.60              0.70              0.70 
           c("2", NA)       c("3", "2") c("广州", "深圳")     c("深圳", "")     c("深圳", NA) 
                  1.0               0.8               0.6               0.7               0.7 
                      [,1] [,2]
    c("2", NA)        0.65  1.0
    c("3", "2")       0.80  0.8
    c("广州", "深圳") 0.60  0.6
    c("深圳", "")     0.70  0.7
    c("深圳", NA)     0.70  0.7    
##### FYI
    test=matrix(c(0.8,0.8,0.8,0.8, 0.7,1,1,1, 0.6,1,0.6,0.6, 0.6,0.6,0.6,0.6, 0.7,0.7,0.7,0.7, 0.7,0.7,0.7,0.7),nrow=4,ncol=6,byrow=F)
    genderLocation<- data.frame(scanner_gender=c(3,2,2,2),cmall_gender=c(2,NA,NA,NA),wechat_gender=c(2,NA,2,2),
                                 scanner_location=c("Guangzhou","Shenzhen","Shenzhen","Shenzhen"),
                                 cmall_location=c("Shenzhen",NA,NA,NA),
                                 wechat_location=c("Shenzhen","","Hongkong","Shenzhen"))
    genderLocation1<-cbind(genderLocation,test)  # binded for some apply functions only accepting one input.
Bylon
  • 21
  • 1
  • 6

1 Answers1

0

The following works for your example data but I don't know how stable it is with all of your data. An issue may occur if some of your rows in df do not share a common value with other rows. However, if you want to keep your output as a list, this should work with no problems (that is, skip Reduce...). Keeping that in mind...

--Your data--

test <- matrix(c(0.8,0.8,0.8,0.8,0.7,1,1,1,0.6,1,0.6,0.6,0.6,0.6,0.6,0.6,rep(0.7,8)), nrow=4)

df <- data.frame(scanner_gender=c(3,2,2,2),
             cmall_gender=c(2,NA,NA,NA),
             wechat_location=c(2,NA,2,2),
             scanner_location=c("Guanzhou","Shenzhen","Shenzhen","Shenzhen"),
             cmall_location=c("Shenzhen",NA,NA,NA),
             wechat_location=c("Shenzhen",NA,"Hongkong","Shenzhen"),
             stringsAsFactors=F)
rownames(df) <- c(156043,156044,156045,156046)

--Operation--

I combine map from purrr with other tidyverse verbs to 1) create a 2-column data frame with df row-entry in first column and test row-entry in second column, 2) then filter out where is.na(A)==T, 3) then summarise the mean by group, 4) then spread into rowwise data frame using A (keys) as columns

L <- map(1:nrow(df),~data.frame(A=unlist(df[.x,]),B=unlist(test[.x,])) %>% 
              filter(!is.na(A)) %>%
              group_by(A) %>%
              summarise(B=mean(B)) %>%
              spread(A,B) )

I then reduce this list to a data frame using Reduce and full_join

newdf <- Reduce("full_join", L)

--Output--

    `2`   `3` Guanzhou Shenzhen Hongkong
1  0.65   0.8      0.6     0.70       NA
2  0.80    NA       NA     0.60       NA
3  0.70    NA       NA     0.60      0.7
4  0.70    NA       NA     0.65       NA
CPak
  • 13,260
  • 3
  • 30
  • 48
  • Thank you Chi Pak! It works for me.By the way, I am using the this package for a test, it costs 30 mins for 175999 rows, about the same as using for-loops. – Bylon Jul 27 '17 at 10:42
  • You can close this question by accepting an answer (check mark to the left) or you can keep this question open if you are looking for other answers. You can also change your accepted answer in the future as well – CPak Jul 27 '17 at 11:38