0

Essentially, I want to create a bar graph that's a sum of all rainfall each year. Readings are recorded about every hour, but sometimes there are multiple readings per hour. Unfortunately, when there are multiple readings in a single hour, PCP01(variable for rainfall in a given hour) displays the sum of the rainfall for that particular hour at that particular minute interval. This results in there being multiple sums of rainfall depending on how late in the hour it is. This messes with the plotting, so I really only want the greatest PCP01 value of each hour. Sometimes there are two PCP01 values in the same hour that have the same value and in this instance, I would want one of them to be removed. The data frame is ~231000 rows. I attempted creating an algorithm to remove the invalid rows, but I couldn't get it working. I fairly new to R, so I feel like there might be a simpler way of accomplishing this that I'm unaware of.

working.data[181121:181136, c(2,3,4,5,6,7,9,10)]


        date.year date.month date.day date.hour date.minute PCP01 SLP SPD
181121      2015          5       21        17          54  0.02  NA   0
181122      2015          5       21        18          54  0.02  NA   3
181123      2015          5       21        19          14  0.03  NA   5
181124      2015          5       21        19          21  0.04  NA   6
181125      2015          5       21        19          35  0.05  NA   5
181126      2015          5       21        19          49  0.07  NA   3
181127      2015          5       21        19          54  0.09  NA   3
181128      2015          5       21        20           9  0.02  NA   3
181129      2015          5       21        20          25  0.04  NA   0
181130      2015          5       21        20          34  0.05  NA   0
181131      2015          5       21        20          44  0.06  NA   5
181132      2015          5       21        20          52    NA  NA   3
181133      2015          5       21        20          54  0.09  NA   0
181134      2015          5       21        21          15  0.04  NA   0
181135      2015          5       21        21          30  0.05  NA   0
181136      2015          5       21        21          42  0.05  NA   3

This is a small portion of the dataset from lines 181121-181136. In this instance I would only want rows 181121, 181127, 181133, and 181136. If anyone needs more data from the data frame, I can give a few more portions.

Tom42
  • 29
  • 3

2 Answers2

0

Here's an approach with dplyr, taking the last row of every hour's data.

working %>%
  group_by(date.year, date.month, date.day, date.hour) %>%
  slice(n()) %>%
  ungroup()

## A tibble: 5 x 9
#     row date.year date.month date.day date.hour date.minute PCP01 SLP     SPD
#   <dbl>     <dbl>      <dbl>    <dbl>     <dbl>       <dbl> <dbl> <lgl> <dbl>
#1 181121      2015          5       21        17          54  0.02 NA        0
#2 181122      2015          5       21        18          54  0.02 NA        3
#3 181127      2015          5       21        19          54  0.09 NA        3
#4 181133      2015          5       21        20          54  0.09 NA        0
#5 181136      2015          5       21        21          42  0.05 NA        3

Or, you could explicitly get each hour's max rain row with this:

working %>%
  group_by(date.year, date.month, date.day, date.hour) %>%
  arrange(-PCP01) %>%
  slice(1) %>%
  ungroup()
Jon Spring
  • 55,165
  • 4
  • 35
  • 53
0

Using distinct should be performant:

library(dplyr)

working.data %>%
  arrange(desc(PCP01)) %>% 
  distinct(date.year, date.month, date.day, date.hour)
Cole
  • 11,130
  • 1
  • 9
  • 24