removing and aggregating duplicates

Question

I've posted a sample of the data I'm working with here.

"Parcel.." is the main indexing variable and there are good amount of duplicates. The duplicates are not consistent in all of the other columns. My goal is to aggregate the data set so that there is only one observation of each parcel. I've used the following code to attempt summing numerical vectors:

aggregate(Ap.sample$X.11~Ap.sample$Parcel..,FUN=sum)

The problem is it removes everything except the parcel and the other vector I reference.

My goal is to use the same rule for certain numerical vectors (sum) (X.11,X.13,X.15, num_units) of observations of that parcelID, a different rule (average) for other numerical vectors (Acres,Ttl_sq_ft,Mtr.Size), and still a different rule (just pick one name) for the character variables (pretend there's another column "customer.name" with different values for the same unique parcel ID, i.e. "Steven condominiums" and "Stephen apartments"), and to just delete the extra observations for all the other variables.

I've tried to use the numcolwise function but that also doesn't do what I need. My instinct would be to specify the columns I want to sum and the columns I want to take the average like so:

DT<-as.data.table(Ap.sample)
sum_cols<-Ap.05[,c(10,12,14)]
mean_cols<-Ap.05[,c(17:19)]

and then use the lapply function to go through each observation and do what I need.

df05<-DT[,lapply(.SD,sum), by=DT$Parcel..,.SDcols=sum_cols]
df05<-DT[,lapply(.SD,mean),by=DT$Parcel..,.SDcols=mean_cols]

but that spits out errors on the first go. I know there's a simpler work around for this than trying to muscle through it.

Steven Beaupré · Accepted Answer · 2015-05-15T01:04:36.157

2

You could do:

library(dplyr)
df %>% 
  # create an hypothetical "customer.name" column 
  mutate(customer.name = sample(LETTERS[1:10], size = n(), replace = TRUE)) %>%
  # group data by "Parcel.."
  group_by(Parcel..) %>% 
  # apply sum() to the selected columns
  mutate_each(funs(sum(.)), one_of("X.11", "X.13", "X.15", "num_units")) %>%
  # likewise for mean()
  mutate_each(funs(mean(.)), one_of("Acres", "Ttl_sq_ft", "Mtr.Size")) %>%
  # select only the desired columns 
  select(X.11, X.13, X.15, num_units, Acres, Ttl_sq_ft, Mtr.Size, customer.name) %>%
  # de-duplicate while keeping an arbitrary value (the first one in row order)
  distinct(Parcel..)

edited May 15 '15 at 01:04

answered May 14 '15 at 23:57

Steven Beaupré

21,343
7
57
77

1

Was just about to comment that I liked your first answer. Follow what you had originally with `select(one_of(...` the mutated variables, `Parcel..`, and the string variables to drop the unwanted variables, then `%>% distinct(Parcel..)` to de-duplicate while keeping an arbitrary value (the first one in row order) of the string variable(s). – Sam Firke May 15 '15 at 00:29
Is there a method to do the same thing without dplyr? `> install.packages('dplyr_0.1.tar.gz',repos=NULL,type='source') Installing package into ‘C:/Users/u0432865/Documents/R/win-library/3.0’ (as ‘lib’ is unspecified) Warning: invalid package 'dplyr_0.1.tar.gz' Error: ERROR: no packages specified Warning in install.packages : running command '"C:/PROGRA~1/R/R-30~1.1/bin/x64/R" CMD INSTALL -l "C:\Users\u0432865\Documents\R\win-library\3.0" "dplyr_0.1.tar.gz"' had status 1 Warning in install.packages : installation of package ‘dplyr_0.1.tar.gz’ had non-zero exit status` – slap-a-da-bias May 15 '15 at 21:49
1

@slap-a-da-bias Try `install.packages("dplyr")` – Steven Beaupré May 16 '15 at 10:17
@ Steven Beaupré Sorry for the delayed response. My stats project is sorta contingent on the library I'm at being open and i didn't make it in to work on my project again until now. I can install dplyr, but when I run `library(dplyr)` it tells me there is no package with that name. my next idea would be to just create a new data frame with the vectors I need in it and to write out for loops for each column. Again I don't know how I would code that, but is that a fix if I can't seem to get dplyr to work? – slap-a-da-bias May 18 '15 at 20:33
NVM. got a newer version of R which fixed the problem. Thanks for the help – slap-a-da-bias May 19 '15 at 18:31

removing and aggregating duplicates

1 Answers1