0

I have some time series data called dat and what I am trying to do is to split it into training and test on a rolling basis.

Say we have 100 days in total, I want to train the model on the first 20 days and test on the next 10 days (so using 30 days for train & test). Then move from day 2 until day 22 (training on 20 days) and then test on the next 10 days (22 - 32). Then do the same but begin on day 3 and train until day 23, and test on the next 10 observations until 33. Keep going until the final model begins on day 70 and trains until 90, tests on the last 10 observations.

I am trying to make it so that the number of days an change, i.e. the total days can be 1000, 1250, 87 etc.

I have a function which trains a logistic model on some data but the data expands as the days increase but its not exactly what I am after.

If I can créate the different training and tests splits then using the rollapply function might give the results I am after.

EDIT: I am not sure if it would be better/ or interesting to train on the first 20 days and then test just on the next 1 day instead of 10 days.

Code:

myfun <- function(model_len, dat, ...){
  dat <- data.frame(dat)
  names(dat) <- c("y", "x1", "x2", "x3")

  fit <- glm(formula, data=dat[(1:model_len),])
  predict(fit, dat[(model_len + 1),])
}

sapply(1:50,  myfun, dat=dat)

Data:

dat <- structure(c(0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 
1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 
0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 
1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 
0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 
0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 
1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 
1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 
1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 
0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 
0, 1, 1, 1, 1, 1, 1157.4779907, 1161.2739868, 1165.064978, 1162.5039794, 
1152.5029784, 1143.5659789, 1131.9999755, 1115.114978, 1101.3089843, 
1088.9449828, 1077.7859863, 1067.7619873, 1059.9439942, 1058.2339967, 
1062.8999879, 1065.9739869, 1071.7789918, 1084.3059937, 1094.9029908, 
1101.5380006, 1106.801001, 1106.7830079, 1105.7230103, 1105.3360108, 
1104.5960206, 1104.4260255, 1106.363025, 1109.688025, 1111.763025, 
1113.7510255, 1118.2270265, 1126.2330201, 1131.9140137, 1132.8030029, 
1133.0679931, 1131.1919921, 1123.4999877, 1109.6529845, 1098.5239806, 
1085.2169738, 1070.7239746, 1058.9449829, 1046.018982, 1037.3779847, 
1030.1209901, 1023.8139955, 1019.6099977, 1018.9979982, 1016.8410036, 
1018.3280031, 1021.1230043, 1020.8710024, 1024.0220033, 1030.0970094, 
1034.7910035, 1040.7799927, 1047.371991, 1052.5719849, 1051.4059814, 
1051.5269836, 1052.2799865, 1052.3579894, 1050.2929931, 1046.6079956, 
1041.8380005, 1035.4400025, 1032.9650025, 1031.6990113, 1035.0920167, 
1041.2500184, 1047.0030091, 1053.8240052, 1062.1109986, 1066.3029907, 
1072.0419922, 1077.5289917, 1079.3439941, 1081.8229858, 1083.4049804, 
1083.0979735, 1081.2649779, 1079.0049803, 1075.0169798, 1073.8739867, 
1074.1959837, 1078.2869871, 1085.5799925, 1091.5880003, 1098.3030028, 
1102.7200072, 1106.8830077, 1112.3160033, 1120.2160033, 1126.9150023, 
1133.6280028, 1136.9040038, 1140.320996, 1143.1609985, 1146.4569946, 
1149.8369995, 1153.297998, 1152.7800049, 1150.6940064, 1147.6130005, 
1143.8229981, 1140.1619995, 1135.5619995, 1129.0449951, 1124.4880005, 
1122.7390015, 1122.5960084, 1125.3989991, 1128.9430054, 1136.8930054, 
1144.3530029, 1151.173999, 1158.3080078, 1167.6070068, 1173.8760009, 
1178.3499999, 1183.494995, 1193.018994, 1203.9989867, 1212.4839843, 
1217.4519897, 1221.0399902, 1222.8859863, 1225.2989868, 1229.2179931, 
1233.0979858, 1235.0249878, 1234.4389893, 1232.6299927, 1230.7069947, 
1230.6179932, 1232.1449952, 1234.6289918, 1234.0659913, 1232.0999879, 
1229.8249879, 1228.1249879, 1224.0649903, 1220.2369874, 1215.8649903, 
1214.1689942, 1214.8499878, 1213.7549926, 1217.246997, 1220.5099975, 
1222.2329955, 1221.1559935, 1219.641992, 1216.0529905, 1211.9979856, 
1206.3969847, 1199.9509886, 1193.1179808, 1185.7209715, 1179.0619749, 
1172.8479857, 1169.2699828, 1167.7309814, 1169.2739868, 1169.3999878, 
1170.2729858, 1171.0019897, 1172.7689941, 1174.7, 1176.7939942, 
1180.7199952, 1184.6089966, 1187.7949951, 1185.9269897, 1185.0529907, 
1182.6129883, 1178.0299805, 1168.1029786, 1156.5709717, 1148.2319702, 
1137.9259643, 1130.0429687, 1121.3169677, 1113.2949707, 1107.2059692, 
1102.4249755, 1098.911975, 1095.860974, 1097.485974, 1093.6249755, 
1086.4079772, 1077.9009704, 1074.0089783, 1072.2119812, 1068.344989, 
1062.2379822, 1057.449994, 1061.7179994, 1060.4010072, 1059.8690125, 
1061.7240113, 1061.7080201, 1058.3970215, 1057.8680176, 1058.2380127, 
1056.2290161, 1053.2240112, 1047.6460082, 1041.7940063, 1040.0410034, 
1040.6190063, 1045.6369994, 1050.1010009, 1128.81199335, 1132.72894074524, 
1136.05951315045, 1133.75860942184, 1126.33398461976, 1121.97836475121, 
1114.98804010824, 1104.18156200269, 1097.85760647863, 1093.48449548066, 
1089.54311267298, 1087.65328775174, 1087.83107177539, 1088.49478389202, 
1089.82480075944, 1091.87386411569, 1093.27921086657, 1096.47071830785, 
1100.97350704044, 1102.6227005604, 1102.82339384036, 1099.6516439508, 
1097.67720586025, 1097.0346199688, 1096.8465665432, 1098.06499020575, 
1100.72546732901, 1106.37447415482, 1111.91023852103, 1114.41117237617, 
1117.75201214987, 1120.7832448975, 1122.20674347869, 1120.07466752834, 
1117.94469547802, 1115.36710590868, 1109.05404401262, 1100.7222309638, 
1096.19725287201, 1087.52132174134, 1079.62024328978, 1075.06498573838, 
1068.53212719186, 1063.28239822121, 1059.64979029538, 1056.61743493392, 
1051.89577236878, 1048.42474757175, 1046.82620161254, 1044.26846536373, 
1043.14861247194, 1041.82684176033, 1041.46047397363, 1044.57471778567, 
1047.19426428227, 1051.05194873158, 1053.13842609047, 1054.50142846281, 
1051.21367146635, 1048.35332113622, 1047.56157998039, 1045.89381512512, 
1043.17345339892, 1042.61503488473, 1040.8783653719, 1039.24423257458, 
1040.09811147224, 1041.49734266536, 1042.67950374485, 1046.49669481677, 
1051.36081397707, 1055.8274040745, 1060.05336092454, 1061.8797055984, 
1063.77402125569, 1065.18506361229, 1065.29696088731, 1066.65724613614, 
1066.94988745651, 1068.16322588922, 1069.21815580453, 1069.83166801363, 
1068.92578972661, 1068.81857632408, 1070.35871095988, 1075.03883372561, 
1081.15799613269, 1086.72961878672, 1091.50584604513, 1094.58719261226, 
1097.09031664919, 1100.22361887307, 1103.94707859945, 1106.8845033995, 
1111.19264545669, 1115.10382303224, 1120.66155045774, 1125.17569412844, 
1129.42943430668, 1132.1180628489, 1134.34300733948, 1133.43510749763, 
1132.00890306928, 1129.33948182459, 1127.89952841272, 1126.73290894484, 
1126.80215199772, 1124.52480561698, 1124.50054032013, 1125.99287400392, 
1128.66498590831, 1130.96736496466, 1133.15142772993, 1137.94462318423, 
1142.78989202382, 1146.70132945013, 1151.6631122644, 1155.87424490588, 
1158.8347892958, 1161.3181459343, 1165.5259415596, 1173.38822864916, 
1181.98934506353, 1190.21226039081, 1194.81109273454, 1197.18527342649, 
1199.09715310016, 1201.08885375729, 1203.47563187564, 1205.40271083986, 
1207.24721647416, 1210.57795500043, 1213.91433880992, 1217.26535187564, 
1219.20293598272, 1220.70837160341, 1222.74566726023, 1221.94893752116, 
1220.47665680486, 1218.61792387106, 1217.58479016906, 1216.06433348629, 
1215.23248801141, 1214.29415629603, 1214.89947702975, 1217.46333121739, 
1218.76682576811, 1221.6747517902, 1223.33620352446, 1222.84608328404, 
1220.3845515427, 1217.15554472911, 1212.80167770729, 1208.2329423066, 
1204.08123494406, 1201.53635399701, 1197.84907704491, 1195.70439885016, 
1193.49731600729, 1189.93090962564, 1187.19653451844, 1185.66257561192, 
1185.77756793459, 1183.90255822654, 1182.89945696687, 1183.06617763669, 
1182.8208264332, 1183.94646343956, 1184.8534641596, 1185.84933033488, 
1187.20748792203, 1188.70677011993, 1186.75278639422, 1183.95251873763, 
1180.62084752452, 1176.63980928409, 1167.55220563799, 1159.14913329151, 
1154.47587831137, 1148.54960418648, 1145.95250178776, 1143.07035314131, 
1137.82269769928, 1133.88338944221, 1130.76687940009, 1128.18812336199, 
1120.80925075608, 1118.40550744598, 1113.93545635589, 1104.9968430839, 
1098.44571145686, 1096.38135988954, 1093.86884942387, 1090.43277224064, 
1085.63821926534, 1082.79744209722, 1083.80625856415, 1083.6723314628, 
1082.00354027587, 1077.87272739245, 1073.8896151646, 1071.01060743464, 
1070.41054586943, 1069.56096911996, 1064.84087682282, 1061.11888950636, 
1058.87994622004, 1055.5466184848, 1054.88694005768, 1053.88913948076, 
1056.96921953021, 1059.95310805114, 77.1228859956622, 81.0362538530292, 
78.8404654349793, 46.4728298378735, 33.7103494024937, 38.1634534707235, 
33.5520386736078, 26.2429467891094, 30.5979953728327, 30.5979953728327, 
31.2223518673486, 33.7665461425831, 36.6962580582319, 37.7398082531122, 
40.5860776927095, 41.0627097257687, 40.7556533339627, 52.526559398101, 
67.2093345204357, 57.3558861837519, 61.809628052695, 65.0522479908148, 
60.3356537763659, 59.9025026642582, 60.6951031882524, 60.0950548232381, 
59.3846485649388, 64.6199416069941, 64.1051430716001, 55.6515339908006, 
58.7835089189351, 55.0890845598537, 48.1838706704649, 46.0064642542491, 
48.4030879681908, 55.5793562399467, 43.3339041496164, 35.5089178322478, 
42.157901440901, 32.5975281088021, 28.6602735068277, 26.9110067493817, 
23.5372731683978, 27.6575715257538, 27.7636741048428, 28.4241344813052, 
27.7437779358905, 33.8748748481366, 38.0173561927228, 37.3614293051309, 
46.7027642395441, 51.6960358269122, 46.2684476430283, 67.9712504992444, 
67.4307596718059, 65.3539239654913, 69.3859268680975, 65.8884694613497, 
48.7463489665683, 48.3776103610145, 58.1513743683333, 53.5784372311078, 
46.4319595892114, 54.1515204375632, 48.0571628692748, 48.6571396623733, 
52.2995925118996, 44.9774509790143, 45.2591195805464, 48.7943143049565, 
56.0044804919092, 57.6982718090011, 75.947686211121, 66.6475291255686, 
63.2031704734223, 66.0494138822722, 66.2641524590373, 64.6800962380417, 
66.0941051628946, 68.6330617447997, 62.298871330898, 58.4734193157287, 
52.329016147723, 43.5650542408412, 44.6973713488007, 56.9666746925596, 
61.477502601121, 70.1850582389349, 68.3785649248245, 64.1672444920065, 
68.1060250901431, 67.2130080618559, 73.8468747118516, 69.6113702464934, 
73.1570958144156, 74.8830412236628, 85.4049570826199, 81.7882678868151, 
79.8159292966814, 65.9053697697576, 57.9091367119927, 44.4025529377091, 
43.2388424796772, 42.7803356293289, 47.7057738515549, 44.7755737074884, 
45.7557906780512, 40.016244653124, 41.4992896665767, 46.6336286507843, 
44.3657650232027, 45.4718259236287, 45.2372613787558, 56.9881807801438, 
58.8717301068573, 68.2039283244873, 73.5215112680329, 78.8594307629251, 
73.0335410836162, 71.845824268758, 73.323376014074, 89.1748677280385, 
88.8275948061702, 88.079358554904, 72.9197089804835, 66.5774741060939, 
65.5905607795046, 60.3560855296636, 60.5351059532554, 61.4085229097936, 
58.076745639994, 63.2173375817626, 67.2733875032827, 68.7459719049055, 
59.9037653356146, 44.6491666372171, 40.4929666577831, 30.2655738215587, 
36.0522832244009, 40.7505784647263, 45.517250253278, 41.5835266382263, 
41.3526668380199, 41.539756712543, 48.3189167794286, 49.8415866657383, 
44.5858982397584, 50.0675010891207, 50.5139938354098, 44.9097955003298, 
37.4247186375495, 41.3952548987526, 39.6467050713014, 39.3953595896288, 
36.8289128008105, 42.8772642627352, 37.5760511024063, 42.0791664435174, 
36.4236440580649, 25.1434697637668, 29.0666072154372, 25.3668839063101, 
34.1040319281821, 34.1351918720353, 42.138526061446, 49.3942545777117, 
53.2282422165058, 60.0907410718325, 59.6946479180297, 56.5126081396889, 
64.5584522103826, 61.6638469740838, 48.5567687748239, 50.4491176695018, 
45.8595330253583, 39.1134283844586, 22.2017732449298, 24.6509068125481, 
33.7409449463083, 27.0354908046699, 36.9033514343542, 31.849732552439, 
28.384694400023, 30.2843907497844, 30.2566110685775, 30.1702095862, 
28.1229085893699, 39.7891005017724, 37.8236546439287, 33.4844836408483, 
42.9231744072258, 49.6425369989148, 43.9761986844232, 44.7318583977582, 
37.1424843378588, 40.8120228103859, 50.807226927847, 47.9214803669887, 
44.995279725301, 41.3197867616665, 47.7401787161256, 40.9599257198947, 
48.8101085201251, 58.7773921954413, 46.8976151314924, 38.7370234461344, 
43.0052200556536, 42.7247275761847, 51.7764243779359, 47.5063348907638, 
48.4623219235214, 51.3175593621287), class = c("xts", "zoo"), .indexCLASS = "Date", .indexTZ = "UTC", tclass = "Date", tzone = "UTC", src = "yahoo", updated = structure(1544977543.47594, class = c("POSIXct", 
"POSIXt")), index = structure(c(1517356800, 1517443200, 1517529600, 
1517788800, 1517875200, 1517961600, 1518048000, 1518134400, 1518393600, 
1518480000, 1518566400, 1518652800, 1518739200, 1519084800, 1519171200, 
1519257600, 1519344000, 1519603200, 1519689600, 1519776000, 1519862400, 
1519948800, 1520208000, 1520294400, 1520380800, 1520467200, 1520553600, 
1520812800, 1520899200, 1520985600, 1521072000, 1521158400, 1521417600, 
1521504000, 1521590400, 1521676800, 1521763200, 1522022400, 1522108800, 
1522195200, 1522281600, 1522627200, 1522713600, 1522800000, 1522886400, 
1522972800, 1523232000, 1523318400, 1523404800, 1523491200, 1523577600, 
1523836800, 1523923200, 1524009600, 1524096000, 1524182400, 1524441600, 
1524528000, 1524614400, 1524700800, 1524787200, 1525046400, 1525132800, 
1525219200, 1525305600, 1525392000, 1525651200, 1525737600, 1525824000, 
1525910400, 1525996800, 1526256000, 1526342400, 1526428800, 1526515200, 
1526601600, 1526860800, 1526947200, 1527033600, 1527120000, 1527206400, 
1527552000, 1527638400, 1527724800, 1527811200, 1528070400, 1528156800, 
1528243200, 1528329600, 1528416000, 1528675200, 1528761600, 1528848000, 
1528934400, 1529020800, 1529280000, 1529366400, 1529452800, 1529539200, 
1529625600, 1529884800, 1529971200, 1530057600, 1530144000, 1530230400, 
1530489600, 1530576000, 1530748800, 1530835200, 1531094400, 1531180800, 
1531267200, 1531353600, 1531440000, 1531699200, 1531785600, 1531872000, 
1531958400, 1532044800, 1532304000, 1532390400, 1532476800, 1532563200, 
1532649600, 1532908800, 1532995200, 1533081600, 1533168000, 1533254400, 
1533513600, 1533600000, 1533686400, 1533772800, 1533859200, 1534118400, 
1534204800, 1534291200, 1534377600, 1534464000, 1534723200, 1534809600, 
1534896000, 1534982400, 1535068800, 1535328000, 1535414400, 1535500800, 
1535587200, 1535673600, 1536019200, 1536105600, 1536192000, 1536278400, 
1536537600, 1536624000, 1536710400, 1536796800, 1536883200, 1537142400, 
1537228800, 1537315200, 1537401600, 1537488000, 1537747200, 1537833600, 
1537920000, 1538006400, 1538092800, 1538352000, 1538438400, 1538524800, 
1538611200, 1538697600, 1538956800, 1539043200, 1539129600, 1539216000, 
1539302400, 1539561600, 1539648000, 1539734400, 1539820800, 1539907200, 
1540166400, 1540252800, 1540339200, 1540425600, 1540512000, 1540771200, 
1540857600, 1540944000, 1541030400, 1541116800, 1541376000, 1541462400, 
1541548800, 1541635200, 1541721600, 1541980800, 1542067200, 1542153600, 
1542240000, 1542326400, 1542585600, 1542672000, 1542758400, 1542931200, 
1543190400, 1543276800, 1543363200, 1543449600, 1543536000), tzone = "UTC", tclass = "Date"), .Dim = c(212L, 
4L), .Dimnames = list(NULL, c("y", "x1", "x2", "x3")))

EDIT: Just so I undestand the function output a little.

I set:

   , n_train = 5
  , n_test = 1

and get the following final 3 outputs:

[[203]]
2018-11-16 2018-11-19 2018-11-20 2018-11-21 2018-11-23 2018-11-26 
1.00045650 0.08862828 0.61874897 1.00620776 0.67800147 0.60795702 

[[204]]
2018-11-19 2018-11-20 2018-11-21 2018-11-23 2018-11-26 2018-11-27 
0.05759443 0.69372082 0.93025186 0.72564291 0.60694731 0.98584268 

[[205]]
2018-11-20 2018-11-21 2018-11-23 2018-11-26 2018-11-27 2018-11-28 
 0.8507988  0.8028078  0.7412901  0.6416496  0.9538837  1.0095700

Are these the predicted probabilities of the event happening? How can we have 1.0095700 as one of the probabilities?

Secondly since n train = 5 and n test = 1, the last output tells me that the first 5 results are the predicted probabilities on the training data and the 6th results is the predicted on the test data, i.e. data 2018-11-28 = 1.0095700 ?, the same being for result 204, 2018-11-27 = 0.98584268.

user8959427
  • 2,027
  • 9
  • 20
  • This [SO post](https://stackoverflow.com/questions/38041167/rolling-regression-and-prediction-with-lm-and-predict) about rolling regression might help you. – phiver Dec 16 '18 at 18:00

1 Answers1

1

I am not sure how you intend to use such a function, but you can wrap some of the code in an extra function, where you compute the training and testing indexes. For example, like so

myfun <- function(fm, dat, train_index, test_index){

  fit <- glm(fm, data=dat[train_index, ])
  predict(fit, newdata = dat[test_index, ], type = 'response')

}


wrapper_myfun <- function(
  dat
  , n_train = 20
  , n_test = 10
){


  stopifnot('y' %in% names(dat))
  f_ <- formula(paste0('y~', paste(setdiff(names(dat), 'y'), collapse = ' + ')))

  stride <- n_train + n_test
  start_position <- seq(1, dim(dat)[1] - stride)

  train_index_list <- lapply(start_position
                          , function(i) seq(i, i + n_train))
  test_index_list <- lapply(start_position
                         , function(i) seq((i + n_train + 1)
                                               , (i + n_train + n_test))) 

  mapply(
    myfun
    , train_index = train_index_list
    , test_index = test_index_list
    , MoreArgs = list(fm = f_, dat = dat)
    , SIMPLIFY = F
  )

}

You can further optimize this code.

Choosing between 1 and 10 time periods for test purposes depends on application quite a bit.

HTH

Plamen Petrov
  • 317
  • 1
  • 5
  • Thanks for your answer! The function looks quit einvolved, how can I apply it now to the `dat` dataframe? – user8959427 Dec 16 '18 at 20:51
  • 1
    Just run ``wrapper_myfun(dat)``, it will produce a list of predictions – Plamen Petrov Dec 16 '18 at 20:56
  • My intention to use the function is that, say we have 2000 days of stock data (where the data is non-stationary). I want to train a model on 20 days and test it on the next 10 days but the model will make very different predictions on whether it was trained on data 4 years ago or trained on data 4 weeks ago. I want to keep the training and test set fixed throughout the period but run different models in order to take in new information - that is the features I feed the model should work well 4 years ago and also (if the model is good) work 4 weeks ago. – user8959427 Dec 16 '18 at 20:58
  • (I am using this for non-stock market data but the same principal holds) - – user8959427 Dec 16 '18 at 20:58
  • I get the following error `Error in eval(substitute(subset), data, env) : object 'train_index' not found ` - should I define `train_index` outside the function first? – user8959427 Dec 16 '18 at 21:00
  • 1
    Yeah, sorry the subset argument has a tricky evaluation. I updated my code, try again – Plamen Petrov Dec 16 '18 at 21:22
  • Oh Great I have it working now. I will take a look at the results, thanks again! – user8959427 Dec 16 '18 at 21:37
  • I have added a small edit to the original question just for my understanding, could you comment on it? – user8959427 Dec 16 '18 at 21:49
  • So, the type of the prediction is 'response' by default. You are getting the prediction for the value of y (the model is linear), not a probability. If you want to get a probability you should apply a logit link function and build logistic regression. – Plamen Petrov Dec 16 '18 at 22:03
  • Regarding the output can you check if ``newdata`` in properly spelled? I may have had a typo in my post. Output should be a single value – Plamen Petrov Dec 16 '18 at 22:04
  • Regarding your latest point, I had `new_data`, not `newdata` as you have currently in your post. I now get one result :) – user8959427 Dec 16 '18 at 22:07
  • I add `fit <- glm(fm, data=dat[train_index, ], family = binomial) predict(fit, newdata = dat[test_index, ], type = "response")` and I obtain many of the probabilties equal to 1. I Will further look into it. – user8959427 Dec 16 '18 at 22:11