0

I am working with more than 6MM rows of ticker symbol data. I would like to grab all of the data for a symbol, do the processing I need, and output the results.

I have written code that tells me what line each ticker starts on (see the code below). I am thinking it would be more efficient if I knew what position a new symbol starts at (instead of the line number) so I could use seek(#) to easily jump to a ticker's starting position. I am also curious as to how to expand this logic to read an entire block of data (start_position to end_position) for a ticker.

import csv
data_line       = 0 # holds the file line number for the symbol
ticker_start        = 0
ticker_end          = 0
cur_sec_ticker  = ""
ticker_dl   = [] # array for holding the line number in the source file for the start of each ticker
reader = csv.reader(open('C:\\temp\sample_data.csv', 'rb'), delimiter=',')
for row in reader:
    if cur_sec_ticker != row[1]:   # only process a new ticker
        ticker_fr = str(data_line) + ',' + row[1] # prep line for inserting into array

        # desired line for inserting into array, ticker_end would be the last 
        # of the current ticker data block, which is the start of the next ticker
        # block (ticker_start - 1)
        #ticker_fr = str(ticker_start) + str(ticker_end) + str(data_line) + ',' + row[1] 

        print ticker_fr
        ticker_dl.append(ticker_fr)
        cur_sec_ticker  = row[1]
    data_line += 1
print ticker_dl

Below I have placed a small sample of how the data file:

seq,Symbol,Date,Open,High,Low,Close,Volume,MA200Close,MA50Close,PrimaryLast,filter_$
1,A,1/1/2008,36.74,36.74,36.74,36.74,0, , ,1,1
2,A,1/2/2008,36.67,36.8,36.12,36.3,1858900, , ,1,1
3,A,1/3/2008,36.3,36.35,35.87,35.94,1980100, , ,1,1
1003,AA,1/1/2008,36.55,36.55,36.55,36.55,0, , ,1,1
1004,AA,1/2/2008,36.46,36.78,36,36.13,7801600, , ,1,1
1005,AA,1/3/2008,36.18,36.67,35.74,36.19,7169000, , ,1,1
2005,AAN,4/20/2009,20,20.7,18.2067,18.68,808700, , ,1,1
2006,AAN,4/21/2009,18.7,19.06,18.6533,18.9933,530200, , ,1,1
2007,AAN,4/22/2009,19.2867,19.6267,18.54,19.1333,801100, , ,1,1
2668,AAP,1/1/2008,37.99,37.99,37.99,37.99,0, , ,1,1
2669,AAP,1/2/2008,37.99,38.15,37.17,37.59,1789200, , ,1,1
2670,AAP,1/3/2008,37.58,38.16,37.35,37.95,1584700, , ,1,1
3670,AAR,1/1/2008,22.94,22.94,22.94,22.94,0, , ,1,1
3671,AAR,1/2/2008,23.1,23.38,22.86,23.15,17100, , ,1,1
3672,AAR,1/3/2008,23,23,22,22.16,45600, , ,1,1
6886,ABB,1/1/2008,28.8,28.8,28.8,28.8,0, , ,1,1
6887,ABB,1/2/2008,29,29.11,28.23,28.64,4697700, , ,1,1
6888,ABB,1/3/2008,27.92,28.35,27.79,28.08,5240100, , ,1,1
Blckknght
  • 100,903
  • 11
  • 120
  • 169
Dr.EMG
  • 159
  • 10

1 Answers1

1

In general, you can get the current position of a file object with the tell method. However, it may be difficult to get that to work with your current code which delegates the file reading to the csv module. It's even hard to do it when reading line by line, since the underlying file object will probably get read in larger chunks than a single line (the readline and readlines methods do some caching in the background to hide this from you).

While I'd skip the whole idea of reading specific bytes, if it's really worth while for your program you'll probably need to take charge of the file reading yourself so that you can keep track of exactly where you are in the file at all times. tell probably isn't necessary.

Something like this might work to read a chunk of data and then split it into lines and values while keeping track of how many bytes have been read so far:

def generate_values(f):
    buf = "" # a buffer of data read from the file
    pos = 0  # the position of our buffer within the file

    while True: # loop until we return at the end of the file
        new_data = f.read(4096) # read up to 4k bytes at a time

        if not new_data: # quit if we got nothing
            if buf:
                yield pos, buf.split(",") # handle any data after last newline
            return

        buf += new_data
        line_start = 0 # index into buf

        try:
            while True: # loop until an exception is raised at end of buf
                line_end = buf.index("\n", line_start) # find end of line
                line = buf[line_start:line_end] # excludes the newline

                if line: # skips blank lines
                    yield pos+line_start, line.split(",") # yield pos,data tuple

                line_start = line_end+1
        except ValueError: # raised by `index()`
            pass

        pos += line_end + 1
        buf = buf[line_end + 1:] # keep left over data from end of the buffer

This might need a little tweaking if your file has line endings other than \n, but it shouldn't be too hard.

Blckknght
  • 100,903
  • 11
  • 120
  • 169
  • Thank you, I understand where you have taken the logic and I understand why. Given my code, might I be able to do something similar if I accumulate the length of each row? I would still be able to analyze what ticker I am on and at the least capture the starting file position for each ticker. – Dr.EMG Dec 10 '12 at 18:06
  • @Dr.EMG: In theory you might be able to reconstruct the length of a line, in practice it may be hard to get it right, since you don't have control over the line reading, value splitting or other details, and a few bytes might get misplaced here or there without you having a chance of noticing. If you want to stick with the `csv` module, I'd suggest you avoid dealing with the file positions at all, and simply working with the rows as they are read. – Blckknght Dec 10 '12 at 18:19
  • Implementing my suggestion does not work (as you implied in your write up) because of the csv parsing. Might I combine the two processes: use readline() to get the length and accumulate it and then parse the line using the csv parser to identify the ticker I am on. I can forgo capturing the end, but could also know the ending position given the start of the next symbol by queuing the current and the next index of the ticker_dl array. – Dr.EMG Dec 10 '12 at 18:23
  • My code does the csv parsing for you (or at least, it splits on commas). It generates a sequence of (byte-position, csv-values) pairs. You can almost drop it in in place of your `csv.reader` call. My point in my last comment was that it might be a bad design choice to try to get at the byte position at all (instead of parsing the data from the CSV file into a better data structure). Do you really need to go back to look up the data for a ticker symbol later on? – Blckknght Dec 10 '12 at 18:42
  • Thanks much. Using your example I was able to move forward. – Dr.EMG Dec 11 '12 at 02:31