1

To tidy part of a dataset I need to separate one column into several. These data takes the form similar to this:

set.seed(2133)
df <- data.frame(a = paste(sample(1:9, 5, replace=T), sample(c("", "%2", "%3"), 5, replace=T),  sample(c("", "%3", "%4"), 5, replace=T), sep="")) 
df
      a
1     6
2 2%3%4
3   6%2
4   3%2
5 5%2%4

Tidyr's separate function wont do the job and the best idea I have is a series of ifelse statements, like this:

df$One <- ifelse(grepl("1", df$a) == T, 1, 0)
df$Two <- ifelse(grepl("2", df$a) == T, 1, 0)
          a One Two
1     6   0   0
2 2%3%4   0   1
3   6%2   0   1
4   3%2   0   1
5 5%2%4   0   1

What's the best way to go about such tidying. I'm sure many people working with Open Data Kit (ODK) for data collection will come across this.

Simon
  • 675
  • 1
  • 6
  • 15
  • Out of interest - did the untidy data come directly from the form? What was your process of getting data from ODK (which part - Aggregate?) into R in the first place? I'm asking as your problem potentially lies in that pathway, not in the data. – florianm Sep 10 '19 at 06:54
  • 1
    @florianm, yes, by using aggregate. Unfortunately we can’t make changes that far down the line. – Simon Sep 11 '19 at 07:37

1 Answers1

2

We can loop the patterns (i.e. 1, 2) with sapply, get a logical matrix, coerce to binary by wrapping with + and assign it as new columns in 'df'

df[c("One", "Two")] <- +(sapply(1:2, grepl, df$a))
df
#     a One Two
#1     6   0   0
#2 2%3%4   0   1
#3   6%2   0   1
#4   3%2   0   1
#5 5%2%4   0   1

If this is to get the binary output for each unique value in 'a' column, we can split the strings in 'a' by the delimiter (%), the output list can be converted to data.frame using stack, transform the 'values' column to factor class by specifying the levels and do the table

table(transform(stack(setNames(strsplit(as.character(df$a), "[%]"),
       1:nrow(df))), values = factor(values, levels= 1:6))[2:1])

Or we can use a convenient function mtabulate from qdapTools after splitting.

library(qdapTools)
mtabulate(strsplit(as.character(df$a), "[%]"))
akrun
  • 874,273
  • 37
  • 540
  • 662