0

I am working with a large time series of oceanographic data which needs a lot of manipulation. I have several days of data missing and would like to interpolate them. Specifically date/depth/temperature. Here is an example of my df:

> tibble(df)
# A tibble: 351,685 x 9
   date       time     depthR    SV  temp salinity conduct density calcSV
   <date>     <times>  <dbl> <dbl> <dbl>    <dbl>   <dbl>   <dbl>  <dbl>
 1 2021-11-17 07:50:18   0.5 1524.  19.7     37.8    51.0    27    1524.
 2 2021-11-17 07:50:22   0.5 1524.  19.9     37.6    50.9    26.8  1524.
 3 2021-11-17 07:50:23   1.1 1524.  19.9     37.6    50.9    26.8  1524.
 4 2021-11-17 07:50:24   1.5 1524.  19.9     37.6    50.9    26.8  1524.
 5 2021-11-17 07:50:25   2   1524.  19.9     37.6    50.9    26.8  1524.

Each date contains over 1000 lines of data and so my idea was to find the max depth of each day to therefore interpolate reasonable max depth values for the missing days between. So far, I have found the max depth per date:

group <- df %>% group_by(date) %>% summarise(max =max(depthR, na.rm=TRUE))
> tibble(group)
# A tibble: 40 x 2
   date         max
   <date>     <dbl>
 1 2021-11-17 685. 
 2 2021-11-18 695. 
 3 2021-11-19 136. 
 4 2021-11-20 138. 
 5 2021-11-21 142. 
 6 2021-11-22  26  
 7 2021-11-23 136. 
 8 2021-11-24 297. 
 9 2021-11-25 613. 
10 2021-11-26  81.1
# ... with 30 more rows

And then I managed to interpolate the missing dates by:

> group <- seq(min(group$date), max(group$date), by = "1 day")
> group <- data.frame(date=group)
> tibble(group)
# A tibble: 69 x 1
   date      
   <date>    
 1 2021-11-17
 2 2021-11-18
 3 2021-11-19
 4 2021-11-20
 5 2021-11-21
 6 2021-11-22
 7 2021-11-23
 8 2021-11-24
 9 2021-11-25
10 2021-11-26
# ... with 59 more rows

As you can see, the previous query was overwritten. So I tried creating a new df for the interpolated dates and tried merging them together. I got the error:

> library(stringr)
> group$combined <-  str_c(group$date, '', dateinterp$date)
Error: Assigned data `str_c(group$date, "", dateinterp$date)` must be compatible with existing data.
x Existing data has 40 rows.
x Assigned data has 69 rows.
i Only vectors of size 1 are recycled.

How can I insert these two matrices of differing length into the dataframe in chronological order without overwriting original data or conflicting? Following that, I'm not sure how I would proceed to interpolate the depths and temperatures for each date. Perhaps starting with something like the following:

depth = seq(1, 200, length.out = 100))

Eventually the date variable will be exchanged for geo coords. Any advice greatly appreciated.

EDIT: As requested by @AndreaM, an example of my data:

> dput(head(df))
structure(list(date = structure(c(18948, 18948, 18948, 18948, 
18948, 18948), class = "Date"), time = structure(c(0.326597222222222, 
0.326643518518519, 0.326655092592593, 0.326666666666667, 0.326678240740741, 
0.326712962962963), format = "h:m:s", class = "times"), depth = c(0.5, 
0.5, 1.1, 1.5, 2, 2.5), SV = c(1524.024, 1524.026, 1524.025, 
1524.008, 1524.016, 1524.084), temp = c(19.697, 19.864, 19.852, 
19.854, 19.856, 19.847), salinity = c(37.823, 37.561, 37.557, 
37.568, 37.573, 37.704), conduct = c(51.012, 50.878, 50.86, 50.876, 
50.884, 51.032), density = c(27, 26.755, 26.758, 26.768, 26.773, 
26.877), calcSV = c(1523.811, 1523.978, 1523.949, 1523.975, 1523.993, 
1524.124)), row.names = 100838:100843, class = "data.frame")
  • 1
    Please share some of your data by pasting the output of `dput(head(df))` into your question. This will make it easier for people to help you. – Andrea M May 03 '22 at 09:39
  • You could merge the original data with the complete dates list. This would give you a new dataset with complete dates and missing entries in the appropriate places. Then an imputation model can be used to predict the missing values. If you share your dataset it will be much easier to help. – George Savva May 03 '22 at 09:50
  • To me it is not clear if you want to put days that do not exist in your dataset ( I mean,there is not the row at all), or you have the row (the `date` and the `time`) but the NAs in the other variables. The first raises the question on what you want to put in the `time`, the second is way easier to think. – s__ May 03 '22 at 10:08
  • @s__ What I would like to do is interpolate a whole new set of temperatures by depth for each missing day. (And eventually for each dependent variable). I managed to create new rows for each missing date in a separate df but from here I am not sure how to proceed. Obviously each date row will need to be repeated many times dependent on the depth and interval I choose. – hydroceanog May 03 '22 at 10:22

1 Answers1

0

one approach, adapt to your case as appropriate:

library(dplyr)
library(lubridate) ## facilitates date-time manipulations

## example data:
patchy_data <- data.frame(date = as.Date('2021-11-01') + sample(1:10, 6),
                          value = rnorm(12)) %>%
    arrange(date)

## create vector of -only!- missing dates:
missing_dates <- 
    setdiff(
        seq.Date(from = min(patchy_data$date),
                 to = max(patchy_data$date),
                 by = '1 day'
                 ),
        patchy_data$date
    ) %>% as.Date(origin = '1970-01-01')

## extend initial dataframe with rows per missing date:
full_data <-
    patchy_data %>%
        bind_rows(data.frame(date = missing_dates,
                             value = NA)
                  ) %>%
        arrange(date)

## group by month and impute missing data from monthwise statistic:
full_data %>%
    mutate(month = lubridate::month(date)) %>%
    group_by(month) %>%
    ## coalesce conveniently replaces ifelse-constructs to replace NAs
    mutate(imputed = coalesce(.$value, mean(.$value, na.rm = TRUE)))

edit One possibility to granulate generated data (missing dates) with additional parameters (e. g. measuring depths) is to use expand.grid as follows. Assuming object names from previous code:

## depths of daily measurements:
observation_depths <- c(0.5, 1.1, 1.5) ## example

## generate dataframe with missing dates x depths:
missing_dates_and_depths  <- 
    setNames(expand.grid(missing_dates, observation_depths),
             c('date','depthR')
             )


## stack both dataframes as above:
full_data <-
    patchy_data %>%
        bind_rows(missing_dates_and_depths) %>%
        arrange(date)

  • Thanks @I_O. This works for me. inserts the missing date rows into the full data frame. The next part, although I can't get it to run, will only give me a single mean value for each day right? I need about 1000 lines for each depth increment. So perhaps trying to populate a new data frame would work? – hydroceanog May 04 '22 at 07:47
  • See edit, plz. However, as you already suggested, at one point it might get easier to generate a "lead" dataframe with all combinations needed (e. g. full time series x depths x ...) and `join_left` the patchy tables to the lead. –  May 04 '22 at 08:37