How would I turn a multivalue string into a usable frequency table in R?

Question

I have a field in a data frame called plugins_Apache_module it contains strings like:

c("mod_perl/1.99_16,mod_python/3.1.3,mod_ssl/2.0.52",
    "mod_auth_passthrough/2.1,mod_bwlimited/1.4,mod_ssl/2.2.23",
    "mod_ssl/2.2.9")

I need a frequency table on the modules, and also their versions.

What is the best way to do this in R? As being rather new in R, I've seen strsplit, gsub, some chatrooms also suggested I use the qdap package.

Ideally I would want the string transformed into a dataframe with a column for every mod, if the module is there, then the version goes in that particular field. How would I accomplish such a transform?

What dataframe format would be suggested if I want top-level frequencies - say mod_ssl (all versions) as well as relational options (mod_perl is very often used with mod_ssl).

I'm not too sure how to handle such variable length data when pushing into a dataframe for processing. Any advice is welcome.

I consider the right answer to look like:

mod_perl   mod_python  mod_ssl  mod_auth_passthrough mod_bwlimited 
1.99_16    3.1.3       2.0.52                      
                       2.2.23   2.1                  1.4
                       2.2.9

So basically the first bit becomes a column and the version(s) that follows become a row entry

You need to tell us what you consider to be the correct answer. For instance some of those three character elements appear to ahve two mods and other just one. — IRTFM, Oct 18 '13 at 22:33

score 1 · Answer 1 · answered Oct 18 '13 at 22:37

st <- c("mod_perl/1.99_16,mod_python/3.1.3,mod_ssl/2.0.52", "mod_auth_passthrough/2.1,mod_bwlimited/1.4,mod_ssl/2.2.23", "mod_ssl/2.2.9")

 scan(text=st, what="", sep=",")
Read 7 items
[1] "mod_perl/1.99_16"         "mod_python/3.1.3"         "mod_ssl/2.0.52"          
[4] "mod_auth_passthrough/2.1" "mod_bwlimited/1.4"        "mod_ssl/2.2.23"          
[7] "mod_ssl/2.2.9"

strsplit( scan(text=st, what="", sep=","), "/")
Read 7 items
[[1]]
[1] "mod_perl" "1.99_16" 

[[2]]
[1] "mod_python" "3.1.3"     

[[3]]
[1] "mod_ssl" "2.0.52" 

[[4]]
[1] "mod_auth_passthrough" "2.1"                 

[[5]]
[1] "mod_bwlimited" "1.4"          

[[6]]
[1] "mod_ssl" "2.2.23" 

[[7]]
[1] "mod_ssl" "2.2.9"  

table( sapply(strsplit( scan(text=st, what="", sep=","), "/"), "[",1)  )
#----------------
Read 7 items
mod_auth_passthrough        mod_bwlimited             mod_perl           mod_python 
                   1                    1                    1                    1 
             mod_ssl 
                   3 

 table( scan(text=st, what="", sep=",") )
#-----------
Read 7 items

mod_auth_passthrough/2.1        mod_bwlimited/1.4         mod_perl/1.99_16 
                       1                        1                        1 
        mod_python/3.1.3           mod_ssl/2.0.52           mod_ssl/2.2.23 
                       1                        1                        1 
           mod_ssl/2.2.9 
                       1

Ok, so sapply turns it into column items? If I wanted to make mod_python the column and save entries like 3.1.3 as the value in the column, could I aso use sapply? I'm not fully understanding the final 2 params of that sapply. — EarlyPoster, Oct 19 '13 at 00:53
"[" is a function of two arguments, so this calls x[1] for each x in first arg. — Frank, Oct 19 '13 at 01:07

Tyler Rinker · Answer 2 · 2014-03-03T04:58:33.037

You ask for at minimum two different things. Adding desired output greatly helped. I'm not sure if what you ask for is what you really want but you asked and it seemed like a fun problem. Ok here's how I would approach this using qdap (this requires qdap version 1.1.0 though):

## load qdap
library(qdap)

## your data
x <- c("mod_perl/1.99_16,mod_python/3.1.3,mod_ssl/2.0.52",
    "mod_auth_passthrough/2.1,mod_bwlimited/1.4,mod_ssl/2.2.23",
    "mod_ssl/2.2.9")

## strsplit on commas and slashes
dat <- unlist(lapply(x, strsplit, ",|/"), recursive=FALSE)

## make just a list of mods per row
mods <- lapply(dat, "[", c(TRUE, FALSE))

## make a string of versions
ver <- unlist(lapply(dat, "[", c(FALSE, TRUE)))

## make a lookup key and split it into lists
key <- data.frame(mod = unlist(mods), ver, row = rep(seq_along(mods), 
   sapply(mods, length)))
key2 <- split(key[, 1:2], key$row)

## make it into freq. counts
freqs <- mtabulate(mods)

## rename assign freq table to vers in case you want freqs ans replace 0 with NA
vers <- freqs
vers[vers==0] <- NA

## loop through and fill the ones in each row using an env. lookup (%l%)
for(i in seq_len(nrow(vers))) {
    x <- vers[i, !is.na(vers[i, ]), drop = FALSE]
    vers[i, !is.na(vers[i, ])] <- colnames(x) %l% key2[[i]]
}

## Don't print the NAs
print(vers, na.print = "")

##   mod_auth_passthrough mod_bwlimited mod_perl mod_python mod_ssl
## 1                                     1.99_16      3.1.3  2.0.52
## 2                  2.1           1.4                      2.2.23
## 3                                                          2.2.9

## the frequency counts per mods 
freqs

##   mod_auth_passthrough mod_bwlimited mod_perl mod_python mod_ssl
## 1                    0             0        1          1       1
## 2                    1             1        0          0       1
## 3                    0             0        0          0       1

How would I turn a multivalue string into a usable frequency table in R?

2 Answers2