3

I have a data set that has repeated names in column 1 and then 3 other columns that are numeric.

I want to combine the rows of repeated names into one column and sum 2 of the columns while leaving the other alone. Is there a simple way to do this? I have been trying to figure it out with sapply and lapply and have read a lot of the Q&As here and can't seem to find a solution

Name <- c("Jeff", "Hank", "Tom", "Jeff", "Hank", "Jeff",
                       "Jeff", "Bill", "Mark")
data.Point.1 <- c(3,4,3,3,4,3,3,6,2)
data.Point.2 <- c(6,9,2,5,7,4,8,2,9)
data.Point.3 <- c(2,2,8,6,4,3,3,3,1)
data <- data.frame(Name, data.Point.1, data.Point.2, data.Point.3)

The data looks like this:

  Name data.Point.1 data.Point.2 data.Point.3
1 Jeff            3            6            2
2 Hank            4            9            2
3  Tom            3            2            8
4 Jeff            3            5            6
5 Hank            4            7            4
6 Jeff            3            4            3
7 Jeff            3            8            3
8 Bill            6            2            3
9 Mark            2            9            1

I'd like to get it to look like this (summing columns 3 and 4 and leaving column 1 alone. I'd like it to look like this:

  Name data.Point.1 data.Point.2 data.Point.3
1 Jeff            3           23           14
2 Hank            4           16            6
3  Tom            3            2            8
8 Bill            6            2            3
9 Mark            2            9            1

Any help would great. Thanks!

dimitris_ps
  • 5,849
  • 3
  • 29
  • 55
user3585829
  • 945
  • 11
  • 24

3 Answers3

3

Another solution which is a bit more straightforward is by using the library dplyr

library(dplyr)
data <- data %>% group_by(Name, data.Point.1) %>%  # group the columns you want to "leave alone"
  summarize(data.Point.2=sum(data.Point.2), data.Point.3=sum(data.Point.3)) # sum columns 3 and 4

if you want to sum over all other columns except those you want to "leave alone" then replace summarize(data.Point.2=sum(data.Point.2), data.Point.3=sum(data.Point.3)) with summarise_each(funs(sum))

dimitris_ps
  • 5,849
  • 3
  • 29
  • 55
1

I'd do it this way using data.table:

setDT(data)[, c(data.Point.1 = data.Point.1[1L], 
                lapply(.SD, sum)), by=Name, 
                .SDcols = -"data.Point.1"]
#    Name data.Point.1 data.Point.2 data.Point.3
# 1: Jeff            3           23           14
# 2: Hank            3           16            6
# 3:  Tom            3            2            8
# 4: Bill            3            2            3
# 5: Mark            3            9            1

We group by Name, and for each group, get first element of data.Point.1, and for the rest of the columns, we compute sum by using base function lapply and looping it through the columns of .SD, which stands for Subset of Data. The columns in .SD is provided by .SDcols, to which we remove data.Point.1, so that all the other columns are provided to .SD.

Check the HTML vignettes for detailed info.

Arun
  • 116,683
  • 26
  • 284
  • 387
0

You could try

library(data.table)
setDT(data)[, list(data.Point.1=data.Point.1[1L],
    data.Point.2=sum(data.Point.2), data.Point.3=sum(data.Point.3)), by=Name]
#    Name data.Point.1 data.Point.2 data.Point.3
#1: Jeff            3           23           14
#2: Hank            4           16            6
#3:  Tom            3            2            8
#4: Bill            6            2            3
#5: Mark            2            9            1

or using base R

data$Name <- factor(data$Name, levels=unique(data$Name))
res <- do.call(rbind,lapply(split(data, data$Name), function(x) {
                x[3:4] <- colSums(x[3:4])
                x[1,]} ))

Or using dplyr, you can use summarise_each to apply the function that needs to be applied on multiple columns, and cbind the output with the 'summarise' output for a single column

library(dplyr)
res1 <- data %>%
            group_by(Name) %>% 
            summarise(data.Point.1=data.Point.1[1L])

res2 <- data %>%
            group_by(Name) %>% 
                          summarise_each(funs(sum), 3:4)
cbind(res1, res2[-1])
#   Name data.Point.1 data.Point.2 data.Point.3
#1 Jeff            3           23           14
#2 Hank            4           16            6
#3  Tom            3            2            8
#4 Bill            6            2            3
#5 Mark            2            9            1

EDIT

The data created and the data showed initially differed in the original post. After the edit on OP's post (by @dimitris_ps), you can get the expected result by replacing group_by(Name) with group_by(Name, data.Point.1) in the res2 <- .. code.

akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks, akrun. The second option worked great. The first option didn't seem to work out well? It gave me different values for Jeff than it should have been. Any ideas? Thanks – user3585829 Mar 18 '15 at 20:32
  • @user3585829 Did you mean the data.table solution is not working as expected? – akrun Mar 18 '15 at 20:33
  • @user3585829 I noted that the data you created `data <- data.frame(...)` and the input data showed had some difference in values. For example, 5th element for 2nd column is 5. – akrun Mar 18 '15 at 20:36
  • @user3585829 Anyway, I tried it on both datasets( the one showed, and other created), and the code gives the same expected output. I am using `data.table_1.9.5` i.e. the devel version – akrun Mar 18 '15 at 20:45