0

I have two sets of data I am working with. The first dataset is a set of individual specimens I have taken measurements from, and the other is a set of reported mean measurements for a given population of a species in previous studies. The first dataset looks like this:

data.frame(Species = c('Species1', "Species1", 'Species1', 'Species2', 'Species3', 'Species3'),
           Specimen = c('A1', 'B2', 'C3', 'D4', 'E5', 'F6'),
           Measurement1 = c(100, 110, 120, 130, 140,150),
           Measurement2 = c(1, 2, 3, 4, 5, 6))

and the other looks like this:

data.frame(Species = c('Species1','Species1', 'Species2', 'Species3'),
                  N = c(10, 10, 11, 12),
                  Measurement1 = c(100, 100, 110, 120),
                  Measurement2 = c(1, 2, 3, 4))

What I am trying to do is find an efficient way to recalculate the average value for a given species given all of the observations for that species. In the case of the example given above, the results would look something like this:

data.frame(Species=c('Species1','Species2','Species3'),
                  N=c(23,12,14),
                  Measurement1=c(101.3043,111.67,123.5714),
                  Measurement2=c(1.565,3,4.214))

I know aggregate() will calculate the mean value for a given data frame, but I don't know of any easy way to recalculate the mean of several summed mean values, or how to do it if the number of entries varies. I know the mean can be recalculated by hand using the formula

(Xx*Nx)+(Xy*Ny)+(Xc*Nc) /(Nx+Ny+Nc)

but I don't know how to write it in R in such a way that it can be done with varying numbers of entries specified by a grouping factor.

user2352714
  • 314
  • 1
  • 15
  • There seems to be some issue in your data. (The brackets are closed at wrong places.). Can you show how you do the calculation for the data shared? How do you get `101.3043` as `Measurement1` in output , also how is `N` 23? – Ronak Shah Nov 14 '20 at 08:06
  • I fixed the coding. For calculation I have been doing it all by hand in Excel, which is why I would like to try and find a machine learning solution. I get 101.3043 because there are 23 observations (3 in the first frame, 20 in the second split among two groups of 10) and so I get ((10*100)+(10*100)+(100+110+120))/23. – user2352714 Nov 14 '20 at 08:18

1 Answers1

1

You can combine the two datasets and then take the weighted mean :

library(dplyr)

data1 %>%
  mutate(N = 1) %>%
  select(-Specimen) %>%
  bind_rows(data2) %>%
  group_by(Species) %>%
  summarise(across(starts_with('Measurement'), weighted.mean, N), 
            N = sum(N))

#  Species  Measurement1 Measurement2     N
#  <chr>           <dbl>        <dbl> <dbl>
#1 Species1         101.         1.57    23
#2 Species2         112.         3.08    12
#3 Species3         124.         4.21    14
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • How would I do this for multiple columns that do not start with 'Measurement'? "Measurement1" is merely the name of a dummy variable I used for the purposes of this question. – user2352714 Nov 14 '20 at 20:01
  • 1
    You can specify column names in `across` : `summarise(across(c(col1, col2), weighted.mean, N)` – Ronak Shah Nov 14 '20 at 23:34