0

For my Masterthesis i have to check different gap-filling methods on an existing dataset. Therefore i have to add artificial gaps of different lengths (1h, 5h..) so i can gap fill them with different methods. Is there an easy function to do so?

here is an example of the dataframe:

   structure(list(DateTime = structure(c(1420074000, 1420077600, 
1420081200, 1420084800, 1420088400, 1420092000, 1420095600, 1420099200, 
1420102800, 1420106400), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    `Dd 1-1` = c(0.0186269166666667, 0.0242605625, 0.00373020138888889, 
    0.000966965277777778, 0.0119253611111111, 0.0495888958333333, 
    0.02014125, 0.0306862638888889, 0.0324395694444444, 0.0191942152777778
    ), `Dd 1-3` = c(0.0242500833333333, 0.0349086388888889, 0, 
    0.00135595138888889, 0.0221090138888889, 0.0600941527777778, 
    0.0462282986111111, 0.0171887638888889, 0.0481975347222222, 
    0.0226582152777778), `Dd 1-5` = c(0.0212732152777778, 0.0284445347222222, 
    0.00276098611111111, 0.0142581875, 0.0276248958333333, 0.0328644027777778, 
    0.0495009166666667, 0.0173377777777778, 0.0384788194444444, 
    0.017663875), luecken = c(0.0186269166666667, 0.0242605625, 
    0.00373020138888889, 0.000966965277777778, 0.0119253611111111, 
    0.0495888958333333, 0.02014125, 0.0306862638888889, 0.0324395694444444, 
    0.0191942152777778)), row.names = c(NA, 10L), class = c("tbl_df", 
"tbl", "data.frame"))

1 Answers1

0

If I understood your problem correctly, one possible solution is this:

set.seed(4) # make it reproducable

del <- sort(sample(1:nrow(df), 4, replace=FALSE)) # get 4 random indexex from the total number of rows and sort them

del2 <-  del[diff(del) !=1] # delete those values that have a difference of 1 (meaning "connected")

df[del2, c(2:5)] <- NA # set column 2 to 5 NA for the indices we calculated above

   DateTime             `Dd 1-1` `Dd 1-3` `Dd 1-5`   luecken
   <dttm>                  <dbl>    <dbl>    <dbl>     <dbl>
 1 2015-01-01 01:00:00  0.0186    0.0243    0.0213  0.0186  
 2 2015-01-01 02:00:00  0.0243    0.0349    0.0284  0.0243  
 3 2015-01-01 03:00:00 NA        NA        NA      NA       
 4 2015-01-01 04:00:00  0.000967  0.00136   0.0143  0.000967
 5 2015-01-01 05:00:00  0.0119    0.0221    0.0276  0.0119  
 6 2015-01-01 06:00:00  0.0496    0.0601    0.0329  0.0496  
 7 2015-01-01 07:00:00  0.0201    0.0462    0.0495  0.0201  
 8 2015-01-01 08:00:00  0.0307    0.0172    0.0173  0.0307  
 9 2015-01-01 09:00:00 NA        NA        NA      NA       
10 2015-01-01 10:00:00  0.0192    0.0227    0.0177  0.0192 

Just to be clear: the step of cleaning the connected gaps it not totally correct as in case of the random numbers been 1 - 4 this would drop 2, 3 and 4 but on large data it should be a sufficient solution if you are not planing to drop many values compared to the whole dataset

now on how to create larger gaps (I will use 3h as your example data has only 10 lines)

set.seed(4)

del <- sort(sample(1:nrow(df), 3, replace=FALSE))

del2 <- del[diff(del) > 3] #set difference to more than maximum size of gap wanted

del3 <- c(del2, del2 + 1, del2 + 2) # build vector with +1 and +2 to get indices conecting conecting to the onces you have

del4 <- del3[del3 <= nrow(df)] # make sure it is not out of bound (max index should be 10 even if gap starts at line 10

df[del4, c(2:5)] <- NA

    DateTime            `Dd 1-1` `Dd 1-3` `Dd 1-5` luecken
   <dttm>                 <dbl>    <dbl>    <dbl>   <dbl>
 1 2015-01-01 01:00:00   0.0186   0.0243   0.0213  0.0186
 2 2015-01-01 02:00:00   0.0243   0.0349   0.0284  0.0243
 3 2015-01-01 03:00:00  NA       NA       NA      NA     
 4 2015-01-01 04:00:00  NA       NA       NA      NA     
 5 2015-01-01 05:00:00  NA       NA       NA      NA     
 6 2015-01-01 06:00:00   0.0496   0.0601   0.0329  0.0496
 7 2015-01-01 07:00:00   0.0201   0.0462   0.0495  0.0201
 8 2015-01-01 08:00:00   0.0307   0.0172   0.0173  0.0307
 9 2015-01-01 09:00:00  NA       NA       NA      NA     
10 2015-01-01 10:00:00  NA       NA       NA      NA     
DPH
  • 4,244
  • 1
  • 8
  • 18
  • but i need to add these gaps more specific. For example: i have hourly data and i need to add gaps of the length of one hour, but that means that two gaps in a row are not allowed – Benjamin Mabrouk Nov 10 '20 at 14:45
  • @BenjaminMabrouk your data is by hour only (one entry per hour)? and all gaps have to be one hour exactly but none of the gaps should be conecting? – DPH Nov 10 '20 at 14:49
  • i´ve added an dput of the dataframe. for example i have 3 entries for a one hour measurement. The created gaps are not allowed to connect to each other – Benjamin Mabrouk Nov 10 '20 at 14:54
  • I have altered my answer according to your data and what I understood - please let me know if that is what you are looking for ... please note that it is not the best solution but it should work suffciently on large data depending on how much you want to delete – DPH Nov 10 '20 at 15:02
  • seem like this works for one hour gaps. Is it possible to adept this on 5 hour gaps that are also not connected do each other? – Benjamin Mabrouk Nov 10 '20 at 15:22
  • @BenjaminMabrouk I included a way to create larger gaps and make sure they do not connect as well - let me know if that works – DPH Nov 10 '20 at 15:35
  • thank you, that worked as well. The last gaps i have to create are of length 1 day and 10 days. Do i have to add this like in del3 as well? – Benjamin Mabrouk Nov 10 '20 at 16:06
  • @BenjaminMabrouk for gaps of "days" it would be bettern to reduce your data.frame to unique dates and perform the the operation on this (TRUE or FALE for dates to be cleared), as 1 day means 24h and 10 mean 240h I would be a lot of typing work when keeping it in hours. In the end the day version can be joined back to the original data.frame by use of a new date column and then you set columns to NA depending on if date should be cleaned or not (TRUE / FALSE) – DPH Nov 10 '20 at 16:15
  • @BenjaminMabrouk 10day gaps are hard to interpolate depending on your data. Only if the data is extremly periodic or if you have explanatory variables you could expect meaningfull results from gap-filling algorithms – DPH Nov 10 '20 at 16:17
  • This is true, but this is exactly ht the people i work with want me to do to see if the results are still "good" or not. Thank you, you helped me a lot. :D – Benjamin Mabrouk Nov 10 '20 at 16:20