2

I have a data frame with 163 observations and 65 columns with some animal data. The 163 observations are from 56 animals, and each was supposed to have triplicated records, but some information was lost so for the majority of animals, I have triplicates ("A", "B", "C") and for some I have only duplicates (which vary among "A" and "B", "A" and "C" and "B" and "C").

Columns 13:65 contain some information I would like to sum, and only retain the one triplicate with the higher rowSums value. So my data frame would be something like this:

  ID    Trip     Acet    Cell   Fibe   Mega   Tera
1   4      A        2       4      9      8      3  
2   4      B        9       3      7      5      5  
3   4      C        1       2      4      8      6 
4   12     A        4       6      7      2      3 
5   12     B        6       8      1      1      2 
6   12     C        5       5      7      3      3 

I am not sure if what I need is to write my own function, or a loop, or what the best alternative actually is - sorry I am still learning and unfortunately for me, I don't think like a programmer so that makes things even more challenging...

So what I want is to know to keep on rows 2 and 6 (which have the highest rowSums among triplicates per animal), but for the whole data frame. What I want as a result is

    ID    Trip    Acet    Cell   Fibe   Mega   Tera
1   4      B       9       3      7      5      5  
2   12     C       5       5      7      3      3

REALLY sorry if the question is poorly elaborated or if it doesn't make sense, this is my first time asking a question here and I have only recently started learning R.

czarniutki
  • 41
  • 5
  • I think since your example contains both numeric and character values, it is not a matrix, but a data frame. – www Jun 28 '19 at 02:48
  • Yes, you are completely right, sorry about that! I've corrected the text as well. Thank you! – czarniutki Jun 28 '19 at 04:00

2 Answers2

1

We can create the row sums separately and use that to find the row with the maximum row sums by using ave. Then use the logical vector to subset the rows of dataset

nm1 <- startsWith(names(df1), "V")

OP updated the column names. In that case, either an index

nm1 <- 3:7

Or select the columns with setdiff

nm1 <- setdiff(names(df1), c("ID", "Trip"))

v1 <- rowSums(df1[nm1], na.rm = TRUE)
i1 <- with(df1, v1 == ave(v1, ID, FUN = max))
df1[i1,]
#  ID Trip V1 V2 V3 V4 V5
#2  4    B  9  3  7  5  5
#6 12    C  5  5  7  3  3

data

df1 <- structure(list(ID = c(4L, 4L, 4L, 12L, 12L, 12L), Trip = structure(c(1L, 
2L, 3L, 1L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"), 
    V1 = c(2L, 9L, 1L, 4L, 6L, 5L), V2 = c(4L, 3L, 2L, 6L, 8L, 
    5L), V3 = c(9L, 7L, 4L, 7L, 1L, 7L), V4 = c(8L, 5L, 8L, 2L, 
    1L, 3L), V5 = c(3L, 5L, 6L, 3L, 2L, 3L)), 
    class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6"))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Ok, sorry, it was really my bad, the columns actually have names, I just used the generic "V"s as an example. I've corrected that in the example. What could I use instead of `nm1 <- startsWith(names(df1), "V") # I would need to identify the columns by position, as their names are long and all different` – czarniutki Jun 28 '19 at 04:30
  • @czarniutki Then, you can use an index if you know the column indexes, `v1 <- rowSums(df1[3:7], na.rm = TRUE)` – akrun Jun 28 '19 at 04:50
  • 1
    I've managed!! Thank you very, very much for your contribution and patience! – czarniutki Jun 28 '19 at 04:52
  • Yes, it took me a while but I managed to figure that out - sorry, still learning, but very much appreciate your help! – czarniutki Jun 28 '19 at 04:53
  • @czarniutki Glad that you were able to figure it out. I was away – akrun Jun 28 '19 at 04:53
1

Here is one way.

library(tidyverse)

dat2 <- dat %>%
  mutate(Sum = rowSums(select(dat, starts_with("V")))) %>%
  group_by(ID) %>%
  filter(Sum == max(Sum)) %>%
  select(-Sum) %>%
  ungroup()
dat2
# # A tibble: 2 x 7
#      ID Trip     V1    V2    V3    V4    V5
#   <int> <fct> <int> <int> <int> <int> <int>
# 1     4 B         9     3     7     5     5
# 2    12 C         5     5     7     3     3

Here is another one. This method makes sure only one row is preserved even there are multiple rows with row sum equals to the maximum.

dat3 <- dat %>%
  mutate(Sum = rowSums(select(dat, starts_with("V")))) %>%
  arrange(ID, desc(Sum)) %>%
  group_by(ID) %>%
  slice(1) %>%
  select(-Sum) %>%
  ungroup()
dat3
# # A tibble: 2 x 7
#      ID Trip     V1    V2    V3    V4    V5
#   <int> <fct> <int> <int> <int> <int> <int>
# 1     4 B         9     3     7     5     5
# 2    12 C         5     5     7     3     3

DATA

dat <- read.table(text = "  ID    Trip     V1     V2    V3    V4    V5
1   4      A       2      4     9     8     3  
2   4      B       9      3     7     5     5  
3   4      C       1      2     4     8     6 
4   12     A       4      6     7     2     3 
5   12     B       6      8     1     1     2 
6   12     C       5      5     7     3     3 ",
                  header = TRUE)
www
  • 38,575
  • 12
  • 48
  • 84
  • Ok, sorry, it was really my bad, the columns actually have names, I just used the generic "V"s as an example. I've corrected that in the example. What could I use instead of `nm1 <- startsWith(names(df1), "V") # I would need to identify the columns by position, as their names are long and all different` – czarniutki Jun 28 '19 at 04:31
  • Ok, I've used `mutate(Sum = rowSums(select(dat, range(5:65))))` and it worked beautifully. Thank you very, very much!!! – czarniutki Jun 28 '19 at 04:52