0

I have large data set of stream chemistry for several streams for long periods of time (7-20 years worth of data). I want to obtain a monthly TOC value for every year for each site but there are times when there is only 1 TOC value for a given month while other months have 2 or more. When I use summarize (example below) I get NA's for instances with only 1 TOC value.

df_TOC <- df %>% group_by(Site,Year, Month) %>% summarise(AvgTOC=mean(TOC))

I would like some help on how to summarize the data with a condition statement something along the lines of:

If TOC has 2 or more values per month, then calculate the mean; if not then paste the single value for that month.

Site  Month Year  TOC
AAA   Jan   1975  5
AAA   Feb   1975  7
AAA   Feb   1975  8
AAA   March 1975  10
AAA   March 1975  12
AAA   April 1975  13
BBB   Jan   1975  6
BBB   Jan   1975  3
BBB   Feb   1975  6
BBB   Feb   1975  12
BBB   Feb   1975  17
BBB   March 1975  4
BRC
  • 13
  • 3
  • 4
    Well, the mean of a single value is the value itself so using `mean` on it will do no harm. Or am I missing something? – Martin Schmelzer Mar 29 '18 at 13:44
  • Do you actually have some NAs in your original data set that you don't realise are there as this would account for why `mean` is returning NA? – Relasta Mar 29 '18 at 13:47
  • Thank you for your quick replies. The issue is that there are instances where there is a TOC value in the original data set but returns an NA when I run the summarize script. That's the part I am confused on. So I though that maybe a conditional summarize would do the trick. – BRC Mar 29 '18 at 14:16
  • 1
    Your example is non reproducible then. Please add a reproducible example. – catastrophic-failure Mar 29 '18 at 14:18
  • Probably you have some `NA` values. Omit them by using mean's `na.rm` argument: `AvgTOC = mean(TOC, na.rm = TRUE)`. If you want to check them out, find them with `filter(your_data, is.na(TOC))`. – Gregor Thomas Mar 29 '18 at 14:55

2 Answers2

1

Your code does exactly what you want already. The mean value of a single value is the value itself.

myData = read.table(textConnection("Site  Month Year  TOC
AAA   Jan   1975  5
AAA   Feb   1975  7
AAA   Feb   1975  8
AAA   March 1975  10
AAA   March 1975  12
AAA   April 1975  13
BBB   Jan   1975  6
BBB   Jan   1975  3
BBB   Feb   1975  6
BBB   Feb   1975  12
BBB   Feb   1975  17
BBB   March 1975  4"), header = TRUE)

library(dplyr)
df_TOC = myData %>% group_by(Site,Year, Month) %>% summarise(AvgTOC=mean(TOC))
> df_TOC 
## A tibble: 7 x 4
## Groups:   Site, Year [?]
#    Site  Year  Month   AvgTOC
#  <fctr> <int> <fctr>    <dbl>
#1    AAA  1975  April 13.00000 #<<<
#2    AAA  1975    Feb  7.50000
#3    AAA  1975    Jan  5.00000 #<<< These ones come from an single sample each
#4    AAA  1975  March 11.00000
#5    BBB  1975    Feb 11.66667
#6    BBB  1975    Jan  4.50000
#7    BBB  1975  March  4.00000 #<<<
catastrophic-failure
  • 3,759
  • 1
  • 24
  • 43
0

I'm a big fan of the base R function aggregate and from the comments it seems you may have NA values that are leading to the NAs - as the mean of a single value should simply be that value. Try out:

aggregate(x = list(TOC_avg = myData$TOC), by = myData[,-4], mean, na.rm = T)

Where we use list() to name the resulting aggregated value, and column 4 in the data is the original TOC column.

Evan Friedland
  • 3,062
  • 1
  • 11
  • 25