Remove NAs from each variable (column) and combine cases

Question

I have a dataset that I am cleaning up and have certain rows (observations) which I would like to combine. The best way to explain what I am trying to do is with the following example:

df<-data.frame(fruits=c("banana","banana","pineapple","kiwi"),cost=c(1,NA,2,3),weight=c(NA,1,2,3),stringsAsFactors = F)
df

cost<-df[,1:2]
weight<-df[,c(1,3)]

cost
weight

cost<-cost[complete.cases(cost),]
weight<-weight[complete.cases(weight),]

key<-data.frame(fruits=unique(df[,1]))
key

mydata<-merge(key,cost,by="fruits",all.x = T)
mydata<-merge(mydata,weight,by="fruits",all.x = T)

mydata

In the previous example I would like to keep the information from both variables (cost and weight) for bananas but unfortunately it is in different records. I am able to accomplish this manually for one variable but my actual dataset have a few dozen variables. I would like to know how can I do the task accomplished above but using dplyr or apply over a set of columns.

The question needs a little more data - can we assume that the cost is always the same for each item? Can we assume that cost and weight are always the same? does your raw data look like df or like cost and weight? — jeremycg, Oct 22 '15 at 21:37
It's a real hassle to see what you're doing here, since you insist on overwriting every single object you create. — Frank, Oct 23 '15 at 00:46

score 2 · Answer 1 · answered Oct 22 '15 at 21:36

2

Using data.table I would something like

library(data.table)
setDT(df)[, lapply(.SD, function(x) x[!is.na(x)]), by = fruits]
#       fruits cost weight
# 1:    banana    1      1
# 2: pineapple    2      2
# 3:      kiwi    3      3

A cleaner but probably slower option would be

setDT(df)[, lapply(.SD, na.omit), by = fruits]
#       fruits cost weight
# 1:    banana    1      1
# 2: pineapple    2      2
# 3:      kiwi    3      3

answered Oct 22 '15 at 21:36

David Arenburg

91,361
17
137
196

I am pretty sure this is a duplicate. – akrun Oct 23 '15 at 07:28

dickoa · Accepted Answer · 2015-10-23T01:09:19.103

2

We can also use the combo dplyr + tidyr:

library(dplyr)
library(tidyr)

df %>%
  gather(key, value, -fruits) %>%
  group_by(fruits) %>%
  na.omit() %>%
  spread(key, value)
## Source: local data frame [3 x 3]

##      fruits  cost weight
##       (chr) (dbl)  (dbl)
## 1    banana     1      1
## 2      kiwi     3      3
## 3 pineapple     2      2

EDIT

You might want to check @Frank solution which is shorter and use dplyr only:

df %>%
  group_by(fruits) %>%
  summarise_each(funs(na.omit))

edited Oct 23 '15 at 01:09

answered Oct 22 '15 at 22:09

dickoa

18,217
3
36
50

1

Or just `df %>% group_by(fruits) %>% summarise_each(funs(na.omit))`? – Frank Oct 23 '15 at 00:51
1

@Frank Thanks, I think is way better using your approach. – dickoa Oct 23 '15 at 01:09
@Frank I would say your aporoach looks awefully familiar :) – David Arenburg Oct 23 '15 at 04:40

Remove NAs from each variable (column) and combine cases

2 Answers2