0

In a hypothetical scenario where I have a large amount of data that is received, and is generally in chronological order upon receipt, is there any way to "play" the data forward or backward, thereby recreating the flow of new information on demand? I know that in a simplistic sense, I can always have a script (with whatever output being unimportant) that starts with a for loop that takes in whatever number of events or observations and does something, then takes in more observations, updates what was previously output with a new result, and so on. Is there a way of doing that is more scaleable than a simple for loop?

Basically, any time I look into this subject I quickly find myself navigating to the subject area of High Frequency Trading, specifically algorithm efficacy by way of backtesting against historical data. While my question is about doing this in a more broad sense where our observation do not need to be stock/option/future price pips, the same principles must apply. Does anyone have such experience on how such a platform is built on a more scaleable level than just a for loop with logic below it? Another example would be health data / claims where one can see backward and forward what happened as more claims come in over time.

Kara
  • 6,115
  • 16
  • 50
  • 57
KidMcC
  • 486
  • 2
  • 7
  • 17
  • I'm not sure if you can really talk about the "scaleability" of a `for` loop. How else would you traverse all elements in a data set of size N without going over all N elements? – Rushy Panchal May 30 '16 at 02:35
  • Perhaps I put it too harshly. To be clear, I would consider a process built around a for loop like multiprocessing whereby multiple processes are running separate for loops and are piped or something, to be different than simply running a for loop. If this was not clear, I apologize, but I would consider an answer such as the one above as different than simply using a for loop. – KidMcC May 30 '16 at 02:41
  • I see. Although it's possible to pipeline data into multiple processes (or even in separate servers) relatively simply, it seems counterintuitive to do so – how can you handle data that is inherently linear (because it is time-based) in parallel? For example, if you are testing a trading algorithm on historical data, shouldn't you be processing it in linear fashion, so that you maintain the same flow of data that you'd get in real-time? – Rushy Panchal May 30 '16 at 02:52
  • Got it. Frankly I think I probably considered "playing" the data back and forth to be the same as backtesting, while backtesting could be done via a multitude of ways, for example using bootstrapping or some other statistical method that still makes judgements as new data is able to be used, but one need not strictly rely on the sequential nature of it. – KidMcC May 30 '16 at 02:57

1 Answers1

0

The problem in optimising the loop is that say you have 3 years and you have only 3 events that you are interested. Then you can use an event based backtesting with 3 iteration only.

The problem here is that you have to precompute the event which will need to the data anyway and most of the time you will need statistics about you backtrading that will need the data also like max-drawdown.

So most backtrading framework will use the loop anyway. Or vectorisation if you are on R/Matlab Numpy

If you really need to optimize it, you probably need to precompute and store all the information and then just do look-ups