22

I am trying to convert a factor column into multiple boolean columns as the image below shows. The data is from weather stations as retrieved using the fine weatherData package. The factor column I want to convert into multiple boolean columns contains 11 factors. Some of them are single "events", and some of them are a combination of "events".

Here is an image showing what I want to achieve: enter image description here This is R code which will produce the data frame with combined factors that I want to convert into several boolean columns:

df <- read.table(text="
date    Events
1/8/2013    Rain
1/9/2013    Fog
1/10/2013   ''
1/11/2013   Fog-Rain
1/12/2013   Snow
1/13/2013   Rain-Snow
1/14/2013   Rain-Thunderstorm
1/15/2013   Thunderstorm
1/16/2013   Fog-Rain-Thunderstorm
1/17/2013   Fog-Thunderstorm
1/18/2013   Fog-Rain-Thunderstorm-Snow",
                 header=T)
df$date <- as.character(as.Date(df$date, "%m/%d/%Y"))

Thanks in advance.

Jose R
  • 930
  • 1
  • 11
  • 22

7 Answers7

13

You could try:

 lst <- strsplit(as.character(df$Events),"-")
 lvl <- unique(unlist(lst))      
 res <- data.frame(date=df$date,
            do.call(rbind,lapply(lst, function(x) table(factor(x, levels=lvl)))), 
                                       stringsAsFactors=FALSE)

  res
 #         date Rain Fog Snow Thunderstorm
 #1  2013-01-08    1   0    0            0
 #2  2013-01-09    0   1    0            0
 #3  2013-01-10    0   0    0            0
 #4  2013-01-11    1   1    0            0
 #5  2013-01-12    0   0    1            0
 #6  2013-01-13    1   0    1            0
 #7  2013-01-14    1   0    0            1
 #8  2013-01-15    0   0    0            1
 #9  2013-01-16    1   1    0            1
 #10 2013-01-17    0   1    0            1
# 11 2013-01-18    1   1    1            1

Or possibly, this could be faster than the above (contributed by @alexis_laz)

  setNames(data.frame(df$date, do.call(rbind,lapply(lst, function(x) as.integer(lvl %in% x)) )), c("date", lvl))  

Or

 library(devtools)
 library(data.table)
 source_gist("11380733")
 library(reshape2) #In case it is needed 

 res1 <- dcast.data.table(cSplit(df, "Events", "-", "long"), date~Events)
 res2 <- merge(subset(df, select=1), res1, by="date", all=TRUE)
 res2 <- as.data.frame(res2)
 res2[,-1]  <- (!is.na(res2[,-1]))+0
 res2[,c(1,3,2,4,5)]
 #          date Rain Fog Snow Thunderstorm
  #1  2013-01-08    1   0    0            0
  #2  2013-01-09    0   1    0            0
  #3  2013-01-10    0   0    0            0
  #4  2013-01-11    1   1    0            0
  #5  2013-01-12    0   0    1            0
  #6  2013-01-13    1   0    1            0
  #7  2013-01-14    1   0    0            1
  #8  2013-01-15    0   0    0            1
  #9  2013-01-16    1   1    0            1
  #10 2013-01-17    0   1    0            1
  #11 2013-01-18    1   1    1            1

Or

 library(qdap)
 with(df, termco(Events, date, c("Rain", "Fog", "Snow", "Thunderstorm")))[[1]][,-2]
 #         date Rain Fog Snow Thunderstorm
 #1  2013-01-08    1   0    0            0
 #2  2013-01-09    0   1    0            0
 #3  2013-01-10    0   0    0            0
 #4  2013-01-11    1   1    0            0
 #5  2013-01-12    0   0    1            0
 #6  2013-01-13    1   0    1            0
 #7  2013-01-14    1   0    0            1
 #8  2013-01-15    0   0    0            1
 #9  2013-01-16    1   1    0            1
 #10 2013-01-17    0   1    0            1
 #11 2013-01-18    1   1    1            1
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Second example needs reshape2 for dcast – Spacedman Aug 17 '14 at 10:21
  • @Spacedman. Thanks. I thought `dcast` alone from `data.table` would work as I loaded both `reshape2` and `data.table`. Seems like `dcast.data.table` is the one that works without using `reshape2` – akrun Aug 17 '14 at 10:40
  • @David Arenburg. I tried it without loading reshape2 and it is working for me. – akrun Aug 17 '14 at 10:52
  • @David Arenburg. Mine is `data.table_1.9.2` – akrun Aug 17 '14 at 10:53
  • @David Arenburg. I tried it again on a fresh console. After `library(data.table)`, checked sessionInfo() `loaded via a namespace (and not attached): [1] plyr_1.8.1 Rcpp_0.11.1 reshape2_1.4 stringr_0.6.2` – akrun Aug 17 '14 at 10:56
  • Yeah, it seems like it is loading it by default now, although in the help files it shows that you need to load it manually. Probably some new feature or something – David Arenburg Aug 17 '14 at 10:57
  • Using `dcast` alone gives me error though. `dcast(cSplit(df, "Events", "-", "long"), date~Events) Error: could not find function "dcast"` – akrun Aug 17 '14 at 10:58
  • Because it is not `dcast` is it is `dcast.data.table`. If `reshape2` is loaded, `dcast` (being a generic function) will understand that it needs to use `dcast.data.table` when receiving a `data.table` object. `melt` function, for example, won't work without `reshape2` – David Arenburg Aug 17 '14 at 11:00
  • I guess you could gain a bit of speed by `lapply(lst, function(x) as.integer(lvl %in% x))`? – alexis_laz Aug 17 '14 at 14:12
  • @alexis_laz. Thanks for the comment. It seems possible. – akrun Aug 17 '14 at 14:15
  • @alexis_laz's suggestion is [a lot faster](https://gist.github.com/mrdwab/507c5f31ffe97e19faa2) (but still not as fast as a raw `splitstackshape:::charMat`). I can't test "qdap" though.... – A5C1D2H2I1M1N2O1R2T1 Aug 17 '14 at 16:26
  • @Ananda Mahto. Thanks for the comment and testing. I guess for the `fun3`, it would be a bit more faster if I didn't convert it to `data.frame` – akrun Aug 17 '14 at 17:16
  • @DavidArenburg, `dcast.data.table` has never required `reshape2`. Not yet. `?dcast.data.table` has never, to my knowledge, suggested this as well. Which part exactly gave you that impression? `dcast` isn't a generic in `reshape2` (unless something has changed recently)... akrun +1. – Arun Aug 17 '14 at 22:45
  • @Arun, maybe I was confusing it with `melt`. But do you explain the error he gets then? Doesn't `dcast` becomes generic when loading `data.table`? – David Arenburg Aug 18 '14 at 00:04
  • `dcast` is a function in `reshape2` package. `melt` is a S3-generic in `reshape2` package for which `.data.frame` method exists in that package, and `.data.table` method in the `data.table` package. Hence the need to load `reshape2` before you can use it as `melt()` for the proper S3 method to be dispatched. Without loading `reshape2`, `dcast()` doesn't therefore exist, hence the error. Please have a look at `?dcast.data.table`. Also the [NAMESPACE](https://github.com/Rdatatable/data.table/blob/master/NAMESPACE) file might make things clearer. – Arun Aug 18 '14 at 01:36
  • @akrun, I don't think those are particularly expensive steps. I tried some benchmarks and didn't get anything to give much of a boost. – A5C1D2H2I1M1N2O1R2T1 Aug 18 '14 at 04:54
  • @Ananda Mahto. Thanks again for testing it. Looks like you won in the speed test along with mso's method :-) – akrun Aug 18 '14 at 05:03
  • @akrun, well, considering that "splitstackshape" is my only serious package, I want to also make sure its efficient. Nice use of `cSplit`, by the way, which should be making its way into whatever the next release of "splitstackshape" would be. – A5C1D2H2I1M1N2O1R2T1 Aug 18 '14 at 05:07
8

The easiest thing I can think of is concat.split.expanded from my "splitstackshape" package (devel version 1.3.0, from GitHub).

## Get the right version of the package
library(devtools)
install_github("splitstackshape", "mrdwab", ref = "devel")
packageVersion("splitstackshape")
# [1] ‘1.3.0’

## Split up the relevant column
concat.split.expanded(df, "Events", "-", type = "character", 
                      fill = 0, drop = TRUE)
#          date Events_Fog Events_Rain Events_Snow Events_Thunderstorm
# 1  2013-01-08          0           1           0                   0
# 2  2013-01-09          1           0           0                   0
# 3  2013-01-10          0           0           0                   0
# 4  2013-01-11          1           1           0                   0
# 5  2013-01-12          0           0           1                   0
# 6  2013-01-13          0           1           1                   0
# 7  2013-01-14          0           1           0                   1
# 8  2013-01-15          0           0           0                   1
# 9  2013-01-16          1           1           0                   1
# 10 2013-01-17          1           0           0                   1
# 11 2013-01-18          1           1           1                   1

Answering this question, I realize that I've somewhat foolishly hard-coded a "trim" feature in concat.split.expanded that could slow things down a lot. If you want a much faster approach, use charMat (the function called by concat.split.expanded) directly on the split up version of your "Events" column, like this:

splitstackshape:::charMat(
    strsplit(as.character(indf[, "Events"]), "-", fixed = TRUE), fill = 0)

For some benchmarks, check out this Gist.

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
4

Can be done with base R using 'grep':

ddf = data.frame(df$date, df$Events, "Rain"=rep(0), "Fog"=rep(0), "Snow"=rep(0), "Thunderstorm"=rep(0)) 

for(i in 3:6)   ddf[grep(names(ddf)[i],ddf[,2]),i]=1

ddf
      df.date                  df.Events Rain Fog Snow Thunderstorm
1  2013-01-08                       Rain    1   0    0            0
2  2013-01-09                        Fog    0   1    0            0
3  2013-01-10                               0   0    0            0
4  2013-01-11                   Fog-Rain    1   1    0            0
5  2013-01-12                       Snow    0   0    1            0
6  2013-01-13                  Rain-Snow    1   0    1            0
7  2013-01-14          Rain-Thunderstorm    1   0    0            1
8  2013-01-15               Thunderstorm    0   0    0            1
9  2013-01-16      Fog-Rain-Thunderstorm    1   1    0            1
10 2013-01-17           Fog-Thunderstorm    0   1    0            1
11 2013-01-18 Fog-Rain-Thunderstorm-Snow    1   1    1            1
rnso
  • 23,686
  • 25
  • 112
  • 234
  • 2
    +1. I'm just updating [my benchmarks](https://gist.github.com/mrdwab/507c5f31ffe97e19faa2) and this is a speedy solution (the fastest so far) if you know in advance the values to expect. – A5C1D2H2I1M1N2O1R2T1 Aug 18 '14 at 04:30
  • I've built on this answer to dynamically identify the unique options so that you don't need to know the values in advance. – Rumpleteaser Oct 18 '20 at 11:01
3

Here's an approach with qdapTools:

library(qdapTools)

matrix2df(mtabulate(lapply(split(as.character(df$Events), df$date), 
    function(x) strsplit(x, "-")[[1]])), "Date")

##          Date Fog Rain Snow Thunderstorm
## 1  2013-01-08   0    1    0            0
## 2  2013-01-09   1    0    0            0
## 3  2013-01-10   0    0    0            0
## 4  2013-01-11   1    1    0            0
## 5  2013-01-12   0    0    1            0
## 6  2013-01-13   0    1    1            0
## 7  2013-01-14   0    1    0            1
## 8  2013-01-15   0    0    0            1
## 9  2013-01-16   1    1    0            1
## 10 2013-01-17   1    0    0            1
## 11 2013-01-18   1    1    1            1

Here is the same answer with magrittr as it makes the chain clearer:

split(as.character(df$Events), df$date) %>%
    lapply(function(x) strsplit(x, "-")[[1]]) %>%
    mtabulate() %>%
    matrix2df("Date")
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
2

Create a vector with factors

set.seed(1)
n <- c("Rain", "Fog", "Snow", "Thunderstorm")
v <- sapply(sample(0:3,100,T), function(i) paste0(sample(n,i), collapse = "-"))
v <- as.factor(v)

Function which returns matrix with desired output that shoulb be cbind'ed to the initial data.frame

mSplit <- function(vec) {
  if (!is.character(vec))
    vec <- as.character(vec)
  L <- strsplit(vec, "-")
  ids <- unlist(lapply(seq_along(L), function(i) rep(i, length(L[[i]])) ))
  U <- sort(unique(unlist(L)))
  M <- matrix(0, nrow = length(vec), 
              ncol = length(U), 
              dimnames = list(NULL, U))
  M[cbind(ids, match(unlist(L), U))] <- 1L
  M
}

Solution is based on the answer of Ananda Mahto to that SO question. It should be pretty fast.

res <- mSplit(v)
Community
  • 1
  • 1
DrDom
  • 4,033
  • 1
  • 21
  • 23
2

I think what you need in this case is a simple call for the function dummy. Let's call the target column. target_cat.

df_target_bin <- data.frame(dummy(target_cat, "<prefix>"))

This will create a new data frame with a column with 0s and 1s values for each value of target_cat.

To convert the columns into logical, and with logical I mean the values be TRUE and FALSE, then use the function as.logical.

df_target_logical <- apply(df_target_bin, as.logical)
Rshad Zhran
  • 496
  • 4
  • 17
2

Building on the answer by @rnso

The following will identify all the unique elements and then dynamically generate new columns with the relevant data in them.

options = unique(unlist(strsplit(df$Events, '-'), recursive=FALSE))
for(o in options){
  df$newcol = rep(0)
  df <- rename(df, !!o := newcol)
  df[grep(o, df$Events), o] = 1
}

Results:

         date                     Events Rain Fog Snow Thunderstorm
1  2013-01-08                       Rain    1   0    0            0
2  2013-01-09                        Fog    0   1    0            0
3  2013-01-10                               0   0    0            0
4  2013-01-11                   Fog-Rain    1   1    0            0
5  2013-01-12                       Snow    0   0    1            0
6  2013-01-13                  Rain-Snow    1   0    1            0
7  2013-01-14          Rain-Thunderstorm    1   0    0            1
8  2013-01-15               Thunderstorm    0   0    0            1
9  2013-01-16      Fog-Rain-Thunderstorm    1   1    0            1
10 2013-01-17           Fog-Thunderstorm    0   1    0            1
11 2013-01-18 Fog-Rain-Thunderstorm-Snow    1   1    1            1
Rumpleteaser
  • 4,142
  • 6
  • 39
  • 52