Most efficient way to loop through multiple csv files and calculate NYSE tick

Question

I have 1 minute data of all NYSE stocks in separate csv files with the following structure:

(data, open, high, low, close, volume)

2013-09-16 09:30:00,461.01,461.49,461,461,183507
2013-09-16 09:31:00,460.82,461.6099,460.39,461.07,212774
...
2013-09-16 15:59:00,449.72,450.0774,449.59,449.95,146399
2013-09-16 16:00:00,450.12,450.12,449.65,449.65,444594
2013-09-17 09:30:00,448,448,447.5,447.96,173624
2013-09-17 09:31:00,449.2628,449.68,447.5,448,193186
....

I want to calculate the NYSE tick (the number of stocks trading on an uptick minus the number of stocks trading on a downtick) and write it to a new csv file with the structure:

2013-09-16 09:30:00,[tick]
2013-09-16 09:31:00,[tick]
...
2013-09-16 15:59:00,[tick]
2013-09-16 16:00:00,[tick]
2013-09-17 09:30:00,[tick]
2013-09-17 09:31:00,[tick]
....

Essentially, I need to:

Loop every csv file
    # For every file compare close[t] to close[t-1]
    If close[t] > close[t-1]:
        increment tick by 1
    Else:
        subtract 1 from tick

How would you do this most efficiently in python?

What have you tried? Most efficiently in terms of speed or in terms of code? — Hari Menon, Oct 05 '13 at 16:20
This is a duplicate of [this question](http://stackoverflow.com/questions/19199672/most-efficient-way-to-loop-through-multiple-csv-files-and-calculate-nyse-tick) asked here on SO, currently on hold. I'm not sure what the protocol is. — DSM, Oct 05 '13 at 16:22
@HariShankar, in terms of code. Speed is not too important... — a_b, Oct 05 '13 at 16:30

score 3 · Accepted Answer · answered Oct 05 '13 at 18:57

I'd probably use the pandas library for this. It has lots of nice features for working with time series in general and OHLC data in particular, but we won't use any here.

import glob
import numpy as np
import pandas as pd

stocks = glob.glob("stock*.csv")

total_tick = 0
for stock in stocks:
    df = pd.read_csv(stock, 
                     names=["time", "open", "high", "low", "close", "volume"],
                     parse_dates=[0], index_col="time")
    tick = df["close"].diff().apply(np.sign).fillna(0.0)
    total_tick += tick

total_tick.to_csv("tick.csv")

which produces an output looking something like

2013-09-16 09:30:00,0.0
2013-09-16 09:31:00,3.0
2013-09-16 15:59:00,-5.0
2013-09-16 16:00:00,-3.0
2013-09-17 09:30:00,1.0
2013-09-17 09:31:00,-1.0

where I've made up sample data looking like yours.

The basic idea is that you can read a csv file into an object called a DataFrame:

>>> df
                         open      high     low       close  volume
time                                                               
2013-09-16 09:30:00  461.0100  461.4900  461.00  453.484089  183507
2013-09-16 09:31:00  460.8200  461.6099  460.39  474.727508  212774
2013-09-16 15:59:00  449.7200  450.0774  449.59  436.010403  146399
2013-09-16 16:00:00  450.1200  450.1200  449.65  455.296584  444594
2013-09-17 09:30:00  448.0000  448.0000  447.50  447.465545  173624
2013-09-17 09:31:00  449.2628  449.6800  447.50  477.785506  193186

We can select a column:

>>> df["close"]
time
2013-09-16 09:30:00    453.484089
2013-09-16 09:31:00    474.727508
2013-09-16 15:59:00    436.010403
2013-09-16 16:00:00    455.296584
2013-09-17 09:30:00    447.465545
2013-09-17 09:31:00    477.785506
Name: close, dtype: float64

Take the difference, noting that if we're subtracting from the previous value, then the initial value is undefined:

>>> df["close"].diff()
time
2013-09-16 09:30:00          NaN
2013-09-16 09:31:00    21.243419
2013-09-16 15:59:00   -38.717105
2013-09-16 16:00:00    19.286181
2013-09-17 09:30:00    -7.831039
2013-09-17 09:31:00    30.319961
Name: close, dtype: float64

Make this either positive or negative, depending on its sign:

>>> df["close"].diff().apply(np.sign)
time
2013-09-16 09:30:00   NaN
2013-09-16 09:31:00     1
2013-09-16 15:59:00    -1
2013-09-16 16:00:00     1
2013-09-17 09:30:00    -1
2013-09-17 09:31:00     1
Name: close, dtype: float64

And fill the NaN with a 0.

>>> df["close"].diff().apply(np.sign).fillna(0)
time
2013-09-16 09:30:00    0
2013-09-16 09:31:00    1
2013-09-16 15:59:00   -1
2013-09-16 16:00:00    1
2013-09-17 09:30:00   -1
2013-09-17 09:31:00    1
dtype: float64

This assumes that the recording times match across all stocks: if not, there are powerful resampling tools available to align them.

@DSM, how would you deal with missing minutes? For example if 09:32:00 is missing like here: `2013-09-16 09:30:00 2013-09-16 09:31:00 2013-09-16 09:33:00 2013-09-16 09:34:00 ...` I need a clean series for all tickers where I forward fill every day from 9:30 to 16:00 — a_b, Oct 09 '13 at 08:13
asked it as a separate question here: [How do you clean and forward fill a multiple day 1 minute time series with pandas?](http://stackoverflow.com/questions/19268003/how-do-you-clean-and-forward-fill-a-multiple-day-1-minute-time-series-with-panda) — a_b, Oct 09 '13 at 09:28

score 1 · Answer 2 · answered Oct 05 '13 at 18:18

There are several parts of this problem that have elegant (but perhaps not obvious) solutions in Python.

The first tricky part is handling opening all the files at once. Normally you want to use with open(filename), but this doesn't work when there's a (possibly variable) large number of files to open. Instead, you can use the contextlib.ExitStack context manager to handle the files:

import contextlib
import csv
import itertools

def tick(filenames):
    with contextlib.ExitStack() as stack:
        files = [stack.enter_context(open(name, newline=""))
                 for name in filenames]

The next bit is to read the handle reading the csv files. This is actually really easy using the csv module:

        CSVs = [csv.reader(file) for file in files]

Now, while we are iterating over the CSV file, we need to be considering each line together with the previous one. That is, we need both the current and the previous lines to be yielded at once. There's an elegant way to do that with itertools.tee and zip (stick this function before or after tick):

def current_and_prev(iterable):
    a, b = itertools.tee(iterable, 2)
    next(a) # throw away first value from "current" iterator
    return zip(a, b)

Back in the tick function, we can wrap this around each csv.reader instance:

        pair_iterators = [current_and_prev(CSV) for CSV in CSVs]

Now, we want to iterate over all the iterators together, in parallel (since we need to check them all to calculate the tick). zip is the tool to do that:

        for pairs in zip(*pair_iterators):

pairs will be a tuple of 2-tuples holding the current and previous lines of the csv data from each file. We need to grab the timestamp from one of the current lines, then iterate over them all to find which are on an uptick and which are on a downtick.

            timestamp = pair[0][0][0] # first file, current line, first column
            tick = 0
            for current, prev in pairs:
                if float(current[-2]) > float(prev[-2]):
                    tick += 1
                elif float(current[-2]) < float(prev[-2]):
                    tick -= 1
            yield timestamp, tick

Most efficient way to loop through multiple csv files and calculate NYSE tick

2 Answers2