1

I'm looking at oxygen concentrations in relation to bottom trawling at different depths in inner Danish waters for the last 40 years. I have a data frame (Oxy) with four columns: ID, Date, Depth and Oxygen. The Oxygen has been measured throughout many years (Date), at many different locations (ID) and at many different Depths down the water column, spanning from 0-50 meters.

I would like to create a data frame where I have the Oxygen for the last 4 meters (Depth) (from the bottom and 4 meters up in the water column) for each station and the corresponding date. The measurements are not by every whole meter but at varying depths. The Depth where the Oxygen has been measured are not the same for each ID, so for one ID it has been sampled at 0.2, 0.4, 0.6 meters etc. and for another ID it has been sampled at 0.67, 1.3, 1.55 meters etc. The Depth for each ID also varies, so for one station the deepest measurement is at 30 meters and for another one it's 46 meters.

I have about 4 million rows, so this is just an output of my data:

ID         Date          Depth    Oxygen
------    ----------     -----    ------
957001    2002-01-14     1.20     12.10
967503    2002-01-28     2.00     11.60
957001    2002-01-22    25.00      7.80
965206    2002-01-28     5.40     11.70
953001    2002-01-31    23.60     10.30
941101    2002-01-22     8.67     12.00
940201    2002-01-17     5.00     11.70
965404    2002-01-30    38.80      9.40
952003    2002-01-08    23.40      6.30
957101    2002-01-15     6.00     11.60

I have been searching on google for an answer but can't seem to find the right one. I can extract the highest value or the top 5 highest values by using arrange(), group_by() and slice(). However, that wouldn't work for my data frame because the measurement intervals vary in depth and it needs to be similar for all ID's and Dates.

I imagine that it could be something like; take the highest value and then keep the values that are within -4 from that highest value.

So, I need to end up with all the deepest (last 4 meters for Depth) measurements for Oxygen dependent on ID and Date.

It would look something like this:

ID         Date          Depth     Oxygen
------    ----------     -----     ------
957001    2002-01-14     30.20      2.10
967503    2002-01-28     28.00      1.60
957001    2002-01-22     29.00      7.80
965206    2002-01-28     30.40      5.70
953001    2002-01-31     23.60     10.30
941101    2002-01-22     28.67      7.00
940201    2002-01-17     30.00      8.70
965404    2002-01-30     38.80      9.40
952003    2002-01-08     23.40      6.30
957101    2002-01-15     46.00      1.60
  • Erik, when you said "last 4 meters from the bottom", Do we know the bottom depth at each station?. Is it registered in each station sample? – Ric Nov 28 '22 at 14:34
  • Can you confirm your example output? e.g., the greatest depth for `967503` is `2.00` in your starting data but `28.00` in the output? – zephryl Nov 28 '22 at 14:39
  • @RicVillalba There are some stations that don't have samples at bottom depth, and I will sort them out. But yes, for each station sample there is an oxygen measurement registered at the bottom. – Erik Christensen Nov 28 '22 at 19:40
  • @zephryl It's was just to illustrate that I start with measurements at all depths and that I want to end with measurements at bottom depth. So it was an example of my data. – Erik Christensen Nov 28 '22 at 21:43

1 Answers1

0

Just as you said, filter to Depth greater than max() - 4 within each ID. Using dplyr:

library(dplyr)

oxy %>%
  group_by(ID) %>%
  filter(Depth >= max(Depth) - 4) %>%
  ungroup() 
# A tibble: 9 × 4
      ID Date       Depth Oxygen
   <dbl> <date>     <dbl>  <dbl>
1 967503 2002-01-28  2      11.6
2 957001 2002-01-22 25       7.8
3 965206 2002-01-28  5.4    11.7
4 953001 2002-01-31 23.6    10.3
5 941101 2002-01-22  8.67   12  
6 940201 2002-01-17  5      11.7
7 965404 2002-01-30 38.8     9.4
8 952003 2002-01-08 23.4     6.3
9 957101 2002-01-15  6      11.6
zephryl
  • 14,633
  • 3
  • 11
  • 30
  • Thank you very much. It worked out perfectly! I just added the Date to the code as well, because I need the oxygen at the last 4 meters for each station every day it was measured. `group_by(StedID, Date)` – Erik Christensen Nov 28 '22 at 21:49