How can I prepare the data that some values are very different to group?

Question

Some values are so different from group because there are missing rows and the data is not continuous make my diffVal unusual.

> df
                  Date diffVal1 diffVal2
1  2017-05-31 04:01:00      718      483
2  2017-05-31 05:01:00      704      477
3  2017-05-31 06:01:00      741      478
4  2017-05-31 07:01:00      874      483
5  2017-05-31 08:01:00      907      495
6  2017-05-31 09:01:00      887      510
7  2017-05-31 10:01:00     2922      514
8  2017-05-31 13:01:00     1012      529
9  2017-05-31 14:01:00      979      539
10 2017-05-31 15:01:00      886      485
11 2017-05-31 16:01:00      818      471

You can see that there are missing rows at Date (hour;11,12) I need to smooth the unusual value to normal.

I'm trying to set unusual value to NULL but the problem is how to know there are unusual value in data frame from big data frame if it is my example data frame I can set the data that is more then 1200 to NA (It is not good idea at all because it is not reasonable.) then use approximated NA values function na.approx() and I have to get plot of these later.

df$diffVal1 <- ifelse((df$diffVal1>1300), NA,df$diffVal1)
df$diffVal1 <- na.approx(df$diffVal1)
> df 
                  Date diffVal1 diffVal2
1  2017-05-31 04:01:00    718.0      483
2  2017-05-31 05:01:00    704.0      477
3  2017-05-31 06:01:00    741.0      478
4  2017-05-31 07:01:00    874.0      483
5  2017-05-31 08:01:00    907.0      495
6  2017-05-31 09:01:00    887.0      510
7  2017-05-31 10:01:00    949.5      514
8  2017-05-31 13:01:00   1012.0      529
9  2017-05-31 14:01:00    979.0      539
10 2017-05-31 15:01:00    886.0      485
11 2017-05-31 16:01:00    818.0      471

What should do with this problem? And how to add missing rows by Date to approximate from adding again?

Thank you so much for your helps.

is the value at 10:01:00 cumulating the values at 11:01:00 and 12:01:00 ? — moodymudskipper, Jun 09 '17 at 10:18
I would sort the dataset with time descending, and compute cumulated values, then find cumulated value at the missing times, then from cumulated data go back to the original data, you'll have 974 for 10 11 and 12 — moodymudskipper, Jun 09 '17 at 10:27
@Moody_Mudskipper yes it is, because it is one val a row and I already made difference_val of them to new column and dropped the val column (but from diffVal2 in example data were missing to match with diffVal1's missing Date column by merging function ). I want to know the function for adding these missing times to df if I try to big data frame not for only my example. Thank you for your help. — Mr.Takeo, Jun 09 '17 at 15:51

score 0 · Accepted Answer · answered Jun 09 '17 at 16:32

Tell me if that works for you:

data preparation :

df <- read.table(text="Date; diffVal1; diffVal2
1;  2017-05-31 04:01:00;      718;      483
2;  2017-05-31 05:01:00;      704;      477
3;  2017-05-31 06:01:00;      741;      478
4;  2017-05-31 07:01:00;      874;      483
5;  2017-05-31 08:01:00;      907;      495
6;  2017-05-31 09:01:00;      887;      510
7;  2017-05-31 10:01:00;     2922;      514
8;  2017-05-31 13:01:00;     1012;      529
9;  2017-05-31 14:01:00;      979;      539
10; 2017-05-31 15:01:00;      886;      485
11; 2017-05-31 16:01:00;      818;      471",sep=";",header=TRUE,stringsAsFactors=FALSE)

df$Date     <- as.POSIXct(df$Date)
df$diffVal1 <- as.numeric(df$diffVal1)
df$diffVal2 <- as.numeric(df$diffVal2)
all_dates <- data.frame(Date = seq(min(df$Date),max(df$Date),by=3600))

work and result :

df2 <- df
df2 <- df2[order(df2$Date,decreasing=TRUE),]
df2$Val1_total <-  cumsum(df2$diffVal1)
df2 <- merge(df2,all_dates,all.y = TRUE)

df2$Val1_total[is.na(df2$Val1_total)] <- approx(x = df2$Date, y = df2$Val1_total, xout = df2$Date[is.na(df2$Val1_total)])$y
df2$diffVal1 <- c(-diff(df2$Val1_total),tail(df2$diffVal1,1))

# > df2
# Date diffVal1 diffVal2 Val1_total
# 1  2017-05-31 04:01:00      718      483      11448
# 2  2017-05-31 05:01:00      704      477      10730
# 3  2017-05-31 06:01:00      741      478      10026
# 4  2017-05-31 07:01:00      874      483       9285
# 5  2017-05-31 08:01:00      907      495       8411
# 6  2017-05-31 09:01:00      887      510       7504
# 7  2017-05-31 10:01:00      974      514       6617
# 8  2017-05-31 11:01:00      974       NA       5643
# 9  2017-05-31 12:01:00      974       NA       4669
# 10 2017-05-31 13:01:00     1012      529       3695
# 11 2017-05-31 14:01:00      979      539       2683
# 12 2017-05-31 15:01:00      886      485       1704
# 13 2017-05-31 16:01:00      818      471        818

Sorry for being gone on days off. It works enough for me but I'll try more a little other way to find the best way how to bin NA val. Btw, If I can't do in other way I'll use yours. Thank you so much. — Mr.Takeo, Jun 12 '17 at 04:11
I tried it with my whole data (more 1k rows) it doesn't work at all on many NA splitted values. — Mr.Takeo, Jun 20 '17 at 01:40
Did you make it work in the end? If not you should add to your question a part of your data that doesn't work. Also from df2 just after the merge you have a situation that others have encountered here, if you Lookup interpolate na or something like that you should find. — moodymudskipper, Jun 20 '17 at 07:02

How can I prepare the data that some values are very different to group?

1 Answers1