-1

I have a dataset that I am cleaning up and have certain rows (observations) which I would like to combine. The best way to explain what I am trying to do is with the following example:

df<-data.frame(fruits=c("banana","banana","pineapple","kiwi"),cost=c(1,NA,2,3),weight=c(NA,1,2,3),stringsAsFactors = F)
df

cost<-df[,1:2]
weight<-df[,c(1,3)]

cost
weight

cost<-cost[complete.cases(cost),]
weight<-weight[complete.cases(weight),]

key<-data.frame(fruits=unique(df[,1]))
key

mydata<-merge(key,cost,by="fruits",all.x = T)
mydata<-merge(mydata,weight,by="fruits",all.x = T)

mydata

In the previous example I would like to keep the information from both variables (cost and weight) for bananas but unfortunately it is in different records. I am able to accomplish this manually for one variable but my actual dataset have a few dozen variables. I would like to know how can I do the task accomplished above but using dplyr or apply over a set of columns.

Frank
  • 66,179
  • 8
  • 96
  • 180
rjss
  • 935
  • 10
  • 23
  • The question needs a little more data - can we assume that the cost is always the same for each item? Can we assume that cost and weight are always the same? does your raw data look like df or like cost and weight? – jeremycg Oct 22 '15 at 21:37
  • It's a real hassle to see what you're doing here, since you insist on overwriting every single object you create. – Frank Oct 23 '15 at 00:46

2 Answers2

2

Using data.table I would something like

library(data.table)
setDT(df)[, lapply(.SD, function(x) x[!is.na(x)]), by = fruits]
#       fruits cost weight
# 1:    banana    1      1
# 2: pineapple    2      2
# 3:      kiwi    3      3

A cleaner but probably slower option would be

setDT(df)[, lapply(.SD, na.omit), by = fruits]
#       fruits cost weight
# 1:    banana    1      1
# 2: pineapple    2      2
# 3:      kiwi    3      3
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
2

We can also use the combo dplyr + tidyr:

library(dplyr)
library(tidyr)

df %>%
  gather(key, value, -fruits) %>%
  group_by(fruits) %>%
  na.omit() %>%
  spread(key, value)
## Source: local data frame [3 x 3]

##      fruits  cost weight
##       (chr) (dbl)  (dbl)
## 1    banana     1      1
## 2      kiwi     3      3
## 3 pineapple     2      2

EDIT

You might want to check @Frank solution which is shorter and use dplyr only:

df %>%
  group_by(fruits) %>%
  summarise_each(funs(na.omit))
dickoa
  • 18,217
  • 3
  • 36
  • 50