Remove NAs in row, and move the cell on the right were the NA was located in R also unique values

Question

OK, so I have a data frame in R like this

ID <- c(1, 2, 3)
c1 <- c( 1, 1, NA)
c2 <- c(NA, NA, 5)
c3 <- c(NA, NA, NA)
c4 <- c(2, NA, 5)
c5 <- c(5, 7, 3)

df <- data.frame(ID, c1, c2, c3, c4, c5)

So, this is what I'm looking for

1. Treat every row as a vector
2. Be able to remove all NAs in every row/vector
3. In a given row there can't be repeated values (expect for ID vs a number in other cell)
4. I'm looking to "cut" this row/vector.  I don't need 5 values just 2.

I'm doing this for a MAP@k metric, so the order of the numbers (the one on the left is more importante than the next one) son it's important to keep the order.

This is the output that I'm looking for

ID <- c(1, 2, 3)
c1 <- c(1, 1, 5)
c2 <- c(2, 7, 3)

df2 <- data.frame(ID, c1, c2)

Thank you for your help

akrun · Answer 1 · 2016-05-17T05:13:14.670

We loop through the rows of 'df' (using apply with MARGIN as 1), remove the NA elements (!is.na(x)) and get the unique values. Then, if the length of the elements are not the same, the output will be a list ('lst'). We use lengths to get the length of each list element, get theminof it, based on it we subset thelistelements andcbind` with the first column 'ID'.

 lst <- apply(df[-1], 1, function(x) unique(x[!is.na(x)]))
 dfN <- cbind(df[1], do.call(rbind,lapply(lst, function(x) x[seq(min(lengths(lst)))])))
 colnames(dfN)[-1] <- paste0("c", colnames(dfN)[-1])
 dfN
 #  ID c1 c2
 #1  1  1  2
 #2  2  1  7
 #3  3  5  3

NOTE: If the length of unique elements are the same in each row (after removing the NA), the output will be a matrix. Just transpose the output and cbind with the first column.

Or another option is data.table which should be very efficient.

library(data.table)
dM <- melt(setDT(df), id.var="ID", na.rm=TRUE)[, 
          .(value = unique(value), n = seq(uniqueN(value))), ID]
dcast(dM[dM[, n1 := min(tabulate(ID))][, .I[1:.N <=n1] , ID]$V1],
           ID~paste0("c", n), value.var="value")
#  ID c1 c2
#1:  1  1  2
#2:  2  1  7
#3:  3  5  3

I think the long-then-wide paradigm is the way to go for this sort of problem. It scales nicely and can be converted to dplyr/data.table/base R/sql whatever relatively easily. — thelatemail, May 17 '16 at 05:32
I used the data.table idea and translate it into dplyr/tidyr. It works! Thank you so much! — Iair Kleiman, May 17 '16 at 11:45

thelatemail · Answer 2 · 2016-05-17T05:13:19.907

1

Ugly but should be efficient (chewed through 3M records in about 20secs and 300K in < 2 secs):

sel <- !is.na(df[-1])
tmp <- unique(data.frame(ID=df$ID[row(df[-1])[sel]], c=df[-1][sel]))
tmp$time <- ave(tmp$ID, tmp$ID, FUN=seq_along)

reshape(tmp[tmp$time <= 2,], idvar="ID", direction="wide", sep="")

#  ID c1 c2
#1  1  1  2
#2  2  1  7
#3  3  5  3

edited May 17 '16 at 05:13

answered May 17 '16 at 05:00

thelatemail

91,185
12
128
188

score 1 · Answer 3 · answered May 17 '16 at 11:53

Based on akrun data.table idea, I translated the data.table code to dplyr/tidyr (is easier for me to read, that's all). Here is the code

library(dplyr)
library(tidyr)

df_tidy <- df %>%
gather(importance, val, c1:c5) %>% 
na.omit %>% 
arrange(ID, importance) %>%
group_by(ID) %>%
distinct(ID, val) %>%
mutate(place = seq_len(n())) %>%
filter(place <= 2) %>%
mutate(place = paste("c", place, sep="")) %>%
select(-importance) %>%
spread(place, val)

Thank you akrun and thelatemail !

Remove NAs in row, and move the cell on the right were the NA was located in R also unique values

3 Answers3