Combining Rows - Summing Certain Columns and Not Others in R

Question

I have a data set that has repeated names in column 1 and then 3 other columns that are numeric.

I want to combine the rows of repeated names into one column and sum 2 of the columns while leaving the other alone. Is there a simple way to do this? I have been trying to figure it out with sapply and lapply and have read a lot of the Q&As here and can't seem to find a solution

Name <- c("Jeff", "Hank", "Tom", "Jeff", "Hank", "Jeff",
                       "Jeff", "Bill", "Mark")
data.Point.1 <- c(3,4,3,3,4,3,3,6,2)
data.Point.2 <- c(6,9,2,5,7,4,8,2,9)
data.Point.3 <- c(2,2,8,6,4,3,3,3,1)
data <- data.frame(Name, data.Point.1, data.Point.2, data.Point.3)

The data looks like this:

  Name data.Point.1 data.Point.2 data.Point.3
1 Jeff            3            6            2
2 Hank            4            9            2
3  Tom            3            2            8
4 Jeff            3            5            6
5 Hank            4            7            4
6 Jeff            3            4            3
7 Jeff            3            8            3
8 Bill            6            2            3
9 Mark            2            9            1

I'd like to get it to look like this (summing columns 3 and 4 and leaving column 1 alone. I'd like it to look like this:

  Name data.Point.1 data.Point.2 data.Point.3
1 Jeff            3           23           14
2 Hank            4           16            6
3  Tom            3            2            8
8 Bill            6            2            3
9 Mark            2            9            1

Any help would great. Thanks!

dimitris_ps · Answer 1 · 2015-03-18T20:57:11.863

3

Another solution which is a bit more straightforward is by using the library dplyr

library(dplyr)
data <- data %>% group_by(Name, data.Point.1) %>%  # group the columns you want to "leave alone"
  summarize(data.Point.2=sum(data.Point.2), data.Point.3=sum(data.Point.3)) # sum columns 3 and 4

if you want to sum over all other columns except those you want to "leave alone" then replace summarize(data.Point.2=sum(data.Point.2), data.Point.3=sum(data.Point.3)) with summarise_each(funs(sum))

edited Mar 18 '15 at 20:57

answered Mar 18 '15 at 19:22

dimitris_ps

5,849
3
29
55

Thanks! I'll try this out. that was an example data set for the question. My true data set has about 50 columns. Is there a way to use the summarize function over a large list of columns? Every time I have used summarize within ddply I had to type out each column and re-name it, as you did. – user3585829 Mar 18 '15 at 19:58
I keep getting an error (I have loaded the plyr package): Error: could not find function "%>%" – user3585829 Mar 18 '15 at 20:37
its the `dplyr` package – dimitris_ps Mar 18 '15 at 20:39
@akur the data.frame `data` is different from what @user3585829 states it is. – dimitris_ps Mar 18 '15 at 20:40
@dimitris_ps yes, just noticed it now. Sorry, I didn't check that part before – akrun Mar 18 '15 at 20:41
Ah. dplyr package. Missed that. Thanks! – user3585829 Mar 18 '15 at 21:15

score 1 · Answer 2 · answered Mar 18 '15 at 21:09

I'd do it this way using data.table:

setDT(data)[, c(data.Point.1 = data.Point.1[1L], 
                lapply(.SD, sum)), by=Name, 
                .SDcols = -"data.Point.1"]
#    Name data.Point.1 data.Point.2 data.Point.3
# 1: Jeff            3           23           14
# 2: Hank            3           16            6
# 3:  Tom            3            2            8
# 4: Bill            3            2            3
# 5: Mark            3            9            1

We group by Name, and for each group, get first element of data.Point.1, and for the rest of the columns, we compute sum by using base function lapply and looping it through the columns of .SD, which stands for Subset of Data. The columns in .SD is provided by .SDcols, to which we remove data.Point.1, so that all the other columns are provided to .SD.

Check the HTML vignettes for detailed info.

akrun · Answer 3 · 2015-03-19T06:57:46.703

You could try

library(data.table)
setDT(data)[, list(data.Point.1=data.Point.1[1L],
    data.Point.2=sum(data.Point.2), data.Point.3=sum(data.Point.3)), by=Name]
#    Name data.Point.1 data.Point.2 data.Point.3
#1: Jeff            3           23           14
#2: Hank            4           16            6
#3:  Tom            3            2            8
#4: Bill            6            2            3
#5: Mark            2            9            1

or using base R

data$Name <- factor(data$Name, levels=unique(data$Name))
res <- do.call(rbind,lapply(split(data, data$Name), function(x) {
                x[3:4] <- colSums(x[3:4])
                x[1,]} ))

Or using dplyr, you can use summarise_each to apply the function that needs to be applied on multiple columns, and cbind the output with the 'summarise' output for a single column

library(dplyr)
res1 <- data %>%
            group_by(Name) %>% 
            summarise(data.Point.1=data.Point.1[1L])

res2 <- data %>%
            group_by(Name) %>% 
                          summarise_each(funs(sum), 3:4)
cbind(res1, res2[-1])
#   Name data.Point.1 data.Point.2 data.Point.3
#1 Jeff            3           23           14
#2 Hank            4           16            6
#3  Tom            3            2            8
#4 Bill            6            2            3
#5 Mark            2            9            1

EDIT

The data created and the data showed initially differed in the original post. After the edit on OP's post (by @dimitris_ps), you can get the expected result by replacing group_by(Name) with group_by(Name, data.Point.1) in the res2 <- .. code.

Thanks, akrun. The second option worked great. The first option didn't seem to work out well? It gave me different values for Jeff than it should have been. Any ideas? Thanks — user3585829, Mar 18 '15 at 20:32
@user3585829 Did you mean the data.table solution is not working as expected? — akrun, Mar 18 '15 at 20:33
@user3585829 I noted that the data you created `data <- data.frame(...)` and the input data showed had some difference in values. For example, 5th element for 2nd column is 5. — akrun, Mar 18 '15 at 20:36
@user3585829 Anyway, I tried it on both datasets( the one showed, and other created), and the code gives the same expected output. I am using `data.table_1.9.5` i.e. the devel version — akrun, Mar 18 '15 at 20:45

Combining Rows - Summing Certain Columns and Not Others in R

3 Answers3

EDIT