0

I have weather dataset from 01 Nov 2007 until 18 May 2008 my data is date-dependent

I want to predict the temperature from 07 May 2008 until 18 May 2008 (which is maybe a total of 10-15 observations) my data size is around 200

I will be using decision tree/RF and SVM & NN to make my prediction

I've never handled data like this so I'm not sure how to sample it if we ignore the bias factor can I sample training data from 01 Nov 2007 to 18 May 2008 and test data from 07 May 2008 to 18 May 2008? or is there a better way to handle this ? or would it be better to first sort my data by date then split my data (ordered) with 80:20 for test and training set then just output the required date?



install.packages("rattle")
install.packages("RGtk2")
library("rattle")

seed <- 42
set.seed(seed)
fname <- system.file("csv", "weather.csv", package = "rattle")
dataset <- read.csv(fname, encoding = "UTF-8")

dataset$Date <- convert_to_date(dataset$Date)

dataset <- dataset[order(as.Date(dataset$Date, format="%Y/%M/%D")),]
dataset <- dataset[1:200,]
str(dataset)
> str(dataset)
'data.frame':   200 obs. of  24 variables:
 $ Date         : Date, format: "2007-11-01" "2007-11-02" "2007-11-03" ...
 $ Location     : chr  "Canberra" "Canberra" "Canberra" "Canberra" ...
 $ MinTemp      : num  8 14 13.7 13.3 7.6 6.2 6.1 8.3 8.8 8.4 ...
 $ MaxTemp      : num  24.3 26.9 23.4 15.5 16.1 16.9 18.2 17 19.5 22.8 ...
 $ Rainfall     : num  0 3.6 3.6 39.8 2.8 0 0.2 0 0 16.2 ...
 $ Evaporation  : num  3.4 4.4 5.8 7.2 5.6 5.8 4.2 5.6 4 5.4 ...
 $ Sunshine     : num  6.3 9.7 3.3 9.1 10.6 8.2 8.4 4.6 4.1 7.7 ...
 $ WindGustDir  : chr  "NW" "ENE" "NW" "NW" ...
 $ WindGustSpeed: int  30 39 85 54 50 44 43 41 48 31 ...
 $ WindDir9am   : chr  "SW" "E" "N" "WNW" ...
 $ WindDir3pm   : chr  "NW" "W" "NNE" "W" ...
 $ WindSpeed9am : int  6 4 6 30 20 20 19 11 19 7 ...
 $ WindSpeed3pm : int  20 17 6 24 28 24 26 24 17 6 ...
 $ Humidity9am  : int  68 80 82 62 68 70 63 65 70 82 ...
 $ Humidity3pm  : int  29 36 69 56 49 57 47 57 48 32 ...
 $ Pressure9am  : num  1020 1012 1010 1006 1018 ...
 $ Pressure3pm  : num  1015 1008 1007 1007 1018 ...
 $ Cloud9am     : int  7 5 8 2 7 7 4 6 7 7 ...
 $ Cloud3pm     : int  7 3 7 7 7 5 6 7 7 1 ...
 $ Temp9am      : num  14.4 17.5 15.4 13.5 11.1 10.9 12.4 12.1 14.1 13.3 ...
 $ Temp3pm      : num  23.6 25.7 20.2 14.1 15.4 14.8 17.3 15.5 18.9 21.7 ...
 $ RainToday    : chr  "No" "Yes" "Yes" "Yes" ...
 $ RISK_MM      : num  3.6 3.6 39.8 2.8 0 0.2 0 0 16.2 0 ...
 $ RainTomorrow : chr  "Yes" "Yes" "Yes" "Yes" ...


nullUser
  • 11
  • 3
  • From the pure implementation aspect, your test data should be independent to the training data. Meaning, your training data should not cover the period 07 May 2008 to 18 May 2008. From the design aspect, you need to consider if there's a seasonality difference. If season is the more variable factor, perhaps getting training data from previous years in the same period would make more sense for your training model. – Adam Quek Jun 10 '22 at 07:18
  • "your test data should be independent to the training data" oh I see, I didn't think of it from that perspective. Thank you – nullUser Jun 11 '22 at 08:51
  • @AdamQuek This mean that I will have to do a non-random sampling right? – nullUser Jun 11 '22 at 15:03
  • Your testing data-set has been defined (7 to 18 May 2008), so just make sure they ain't in the training set You may also consider randomising your training set into [training and validation sets](https://towardsdatascience.com/training-vs-testing-vs-validation-sets-a44bed52a0e1). Deciding how you want to split training and validation though depends on the context of the data. If you don't quite care about the seasonal variation for example, than simple 80-20/70-30/60-40 random splits are fine. – Adam Quek Jun 12 '22 at 04:14

0 Answers0