1

OK, so I have a data frame in R like this

ID <- c(1, 2, 3)
c1 <- c( 1, 1, NA)
c2 <- c(NA, NA, 5)
c3 <- c(NA, NA, NA)
c4 <- c(2, NA, 5)
c5 <- c(5, 7, 3)

df <- data.frame(ID, c1, c2, c3, c4, c5)

So, this is what I'm looking for

1. Treat every row as a vector
2. Be able to remove all NAs in every row/vector
3. In a given row there can't be repeated values (expect for ID vs a number in other cell)
4. I'm looking to "cut" this row/vector.  I don't need 5 values just 2.

I'm doing this for a MAP@k metric, so the order of the numbers (the one on the left is more importante than the next one) son it's important to keep the order.

This is the output that I'm looking for

ID <- c(1, 2, 3)
c1 <- c(1, 1, 5)
c2 <- c(2, 7, 3)

df2 <- data.frame(ID, c1, c2)

Thank you for your help

Iair Kleiman
  • 221
  • 2
  • 8

3 Answers3

2

We loop through the rows of 'df' (using apply with MARGIN as 1), remove the NA elements (!is.na(x)) and get the unique values. Then, if the length of the elements are not the same, the output will be a list ('lst'). We use lengths to get the length of each list element, get theminof it, based on it we subset thelistelements andcbind` with the first column 'ID'.

 lst <- apply(df[-1], 1, function(x) unique(x[!is.na(x)]))
 dfN <- cbind(df[1], do.call(rbind,lapply(lst, function(x) x[seq(min(lengths(lst)))])))
 colnames(dfN)[-1] <- paste0("c", colnames(dfN)[-1])
 dfN
 #  ID c1 c2
 #1  1  1  2
 #2  2  1  7
 #3  3  5  3

NOTE: If the length of unique elements are the same in each row (after removing the NA), the output will be a matrix. Just transpose the output and cbind with the first column.


Or another option is data.table which should be very efficient.

library(data.table)
dM <- melt(setDT(df), id.var="ID", na.rm=TRUE)[, 
          .(value = unique(value), n = seq(uniqueN(value))), ID]
dcast(dM[dM[, n1 := min(tabulate(ID))][, .I[1:.N <=n1] , ID]$V1],
           ID~paste0("c", n), value.var="value")
#  ID c1 c2
#1:  1  1  2
#2:  2  1  7
#3:  3  5  3
akrun
  • 874,273
  • 37
  • 540
  • 662
1

Ugly but should be efficient (chewed through 3M records in about 20secs and 300K in < 2 secs):

sel <- !is.na(df[-1])
tmp <- unique(data.frame(ID=df$ID[row(df[-1])[sel]], c=df[-1][sel]))
tmp$time <- ave(tmp$ID, tmp$ID, FUN=seq_along)

reshape(tmp[tmp$time <= 2,], idvar="ID", direction="wide", sep="")

#  ID c1 c2
#1  1  1  2
#2  2  1  7
#3  3  5  3
thelatemail
  • 91,185
  • 12
  • 128
  • 188
1

Based on akrun data.table idea, I translated the data.table code to dplyr/tidyr (is easier for me to read, that's all). Here is the code

library(dplyr)
library(tidyr)

df_tidy <- df %>%
gather(importance, val, c1:c5) %>% 
na.omit %>% 
arrange(ID, importance) %>%
group_by(ID) %>%
distinct(ID, val) %>%
mutate(place = seq_len(n())) %>%
filter(place <= 2) %>%
mutate(place = paste("c", place, sep="")) %>%
select(-importance) %>%
spread(place, val)

Thank you akrun and thelatemail !

Iair Kleiman
  • 221
  • 2
  • 8