Optimized processing of very large files

Question

My task is relatively simple: for each line in an input file, test whether the line satisfies a given set of conditions, and if so, write specific columns of that line to a new file. I've written a python script that does this, but I'd like some help on 1) improving speed, 2) the best way to work in terms of column names (as column numbers can vary from file to file), and 3) the best way to specify my filtering conditions and desired output columns.

1) The files I work with contain photometry for astronomical images. Each file is around 1e6 lines by 150 columns of floats, typically over 1GB in size. I have an old AWK script that will process files like this in about 1 minute; my python script takes between 5 and 7 minutes. I often need to tweak the filtering conditions and rerun several times until the output file is what I want, so speed is definitely desirable. I've found that the for loop is plenty fast; it's how I do things inside the loop that slow it down. Using itemgetter to pick out just the columns I want was a big improvement over reading the entire line into memory, but I'm unsure of what I can do to further increase speed. Can this ever be as fast as AWK?

2) I'd like to work in terms of column names instead of column numbers since the column number of a particular quantity (photon counts, background, signal-to-noise, etc) can change between files. In my AWK script, I always need to check that the column numbers are correct where conditions and output columns are specified, even if the filtering and output apply to the same quantities. My solution in python has been to create a dictionary that assigns a column number to each quantity. When a file has different columns, I only need to specify a new dictionary. Perhaps there is a better way to do this?

3) Ideally, I would only need to specify the names of the input and output files, the filtering conditions, and desired columns to output, and they would be found at the top of my script so I wouldn't need to go searching through the code just to tweak something. My main issue has been with undefined variables. For example, a typical condition is 'SNR > 4', but 'SNR' (signal-to-noise) isn't actually assigned a value until lines start being read from the photometry file. My solution has been to use a combination of strings and eval/exec. Again, maybe there is a better way?

I'm not at all trained in computer science (I'm a grad student in astronomy) - I typically just hack something together and debug until it works. However, optimization with regard to my three points above has become extremely important for my research. I apologize for the lengthy post, but I felt that the details would be helpful. Any and all advice you have for me, in addition to just cleaning things up/coding style, would be greatly appreciated.

Thanks so much, Jake

#! /usr/bin/env python2.6

from operator import itemgetter


infile = 'ugc4305_1.phot'
outfile = 'ugc4305_1_filt.phot'

# names must belong to dicitonary
conditions = 'OBJ <= 2 and SNR1 > 4 and SNR2 > 4 and FLAG1 < 8 and FLAG2 < 8 and (SHARP1 + SHARP2)**2 < 0.075 and (CROWD1 + CROWD2) < 0.1'

input = 'OBJ, SNR1, SNR2, FLAG1, FLAG2, SHARP1, SHARP2, CROWD1, CROWD2'
    # should contain all quantities used in conditions

output = 'X, Y, OBJ, COUNTS1, BG1, ACS1, ERR1, CHI1, SNR1, SHARP1, ROUND1, CROWD1, FLAG1, COUNTS2, BG2, ACS2, ERR2, CHI2, SNR2, SHARP2, ROUND2, CROWD2, FLAG2'

# dictionary of col. numbers for the more important qunatities
columns = dict(EXT=0, CHIP=1, X=2, Y=3, CHI_GL=4, SNR_GL=5, SHARP_GL=6, ROUND_GL=7, MAJAX_GL=8, CROWD_GL=9, OBJ=10, COUNTS1=11, BG1=12, ACS1=13, STD1=14, ERR1=15, CHI1=16, SNR1=17, SHARP1=18, ROUND1=19, CROWD1=20, FWHM1=21, ELLIP1=22, PSFA1=23, PSFB1=24, PSFC1=25, FLAG1=26, COUNTS2=27, BG2=28, ACS2=29, STD2=30, ERR2=31, CHI2=32, SNR2=33, SHARP2=34, ROUND2=35, CROWD2=36, FWHM2=37, ELLIP2=38, PSFA2=39, PSFB2=40, PSFC2=41, FLAG2=42)



f = open(infile)
g = open(outfile, 'w')


# make string that extracts values for testing
input_items = []
for i in input.replace(',', ' ').split():
    input_items.append(columns[i])
input_items = ', '.join(str(i) for i in input_items)

var_assign = '%s = [eval(i) for i in itemgetter(%s)(line.split())]' % (input, input_items) 


# make string that specifies values for writing
output_items = []
for i in output.replace(',', ' ').split():
    output_items.append(columns[i])
output_items = ', '.join(str(i) for i in output_items)

output_values = 'itemgetter(%s)(line.split())' % output_items


# make string that specifies format for writing
string_format = []
for i in output.replace(',', ' ').split():
    string_format.append('%s')
string_format = ' '.join(string_format)+'\n'


# main loop
for line in f:
   exec(var_assign)
   if eval(conditions):
      g.write(string_format % tuple(eval(output_values)))
f.close()
g.close()

Can you show us the data file, and the desired output you want. You should also replace input.replace(',', ' ').split() by input.split(","). — imsc, Feb 25 '11 at 20:29
Raw photometry file [here](http://homepages.spa.umn.edu/~jsimones/ugc4305_1.phot.tar.gz), and output file [here](http://homepages.spa.umn.edu/~jsimones/ugc4305_1_filt.phot.tar.gz). As for the `replace` statement, I guess I only have it in there just incase I forget to leave a space between the entries in my `input` or `output` strings. I agree with your suggestion though, it's cleaner. — Jake, Feb 25 '11 at 20:52
These files are too big to download. Can you build an example with a few lines and columns. — imsc, Feb 25 '11 at 21:02
Sorry for the delay - go [here](http://homepages.spa.umn.edu/~jsimones/test.txt) for a sample of the original photometry file, and [here](http://homepages.spa.umn.edu/~jsimones/test_filt.txt) for the output of my script. — Jake, Feb 26 '11 at 04:41
Also, thank you to everyone who has helped me so far. I'm still digesting all the suggestions and trying to figure out what works best for me. I'm definitely learning a bunch of new stuff! New suggestions are still welcome, and I'll let people know what I find. — Jake, Feb 26 '11 at 04:45

score 2 · Answer 1 · answered Feb 25 '11 at 20:05

2

I don't think you mentioned it, but it looks like your data is in csv. You might get a lot out of using csv.DictReader. You can iterate over files 1 line at a time (avoiding loading the whole thing into memory) and refer to columns by their names.

You should also take a look at cProfile, Python's profiler, if you haven't already. It will tell you what bits of your program are taking the most time to execute.

answered Feb 25 '11 at 20:05

nmichaels

49,466
12
107
135

Thanks for the tip on cProfile - I'll take a look at that, though I'm not sure I would know how to recode the slow bits. – Jake Feb 25 '11 at 20:22
Also, my data aren't csv, just plain white-space separated. The columns aren't named in the file; I have to name them myself, which is why I implemented the dictionary. Lastly, my understanding of `for line in f:` was that it iterates through line by line without loading the whole thing into memory. Either way, I've tested the for loop and it's very fast. It's the stuff inside that's slow (I could confirm that with cProfile I suppose), I just don't know alternative ways of writing it to make it faster as I'm relatively new to Python. – Jake Feb 25 '11 at 20:32
Sorry, I totally misunderstood your suggestion. I'll check out csv.DictReader. – Jake Feb 25 '11 at 21:52

JimB · Answer 2 · 2011-02-25T21:44:15.203

My first step here, would be to get rid of the exec() and eval() calls. Each time you eval a string, it has to be compiled, and then executed, adding to the overhead of your function call on every line of your file. Not to mention, eval tends to lead to messy, hard to debug code, and should generally be avoided.

You can start refactoring by putting your logic into a small, easily understandable functions. For example, you can replace eval(conditions) with a function, e.g.:

def conditions(d):
    return (d[OBJ] <= 2 and
            d[SNRI] > 4 and
            d[SNR2] > 4 and
            d[FLAG1] < 8 and ...

Tip: if some of your conditionals have higher probability of failing, put them in first, and python will skip the evaluation of the rest.

I would get rid of the dictionary of column names, and simply set a bunch of variables at the top of your file, then refer to columns by line[COLNAME]. This may help you simplify some parts like the conditions function, and you can refer to the columns by name, without having to assign each variable.

I'll definitely try this approach. I like the tip about placement of conditionals, and it totally makes sense. I often have to do many tests for each line. — Jake, Feb 25 '11 at 22:27

Joe Kington · Answer 3 · 2011-02-25T22:32:13.150

Here's how I would go about something like this...

This runs in ~35 secs vs. ~3 minutes for your original script on my machine. It is possible to add a few more optimizations (we only need to convert a few of the columns to floats, for example), but that only shaves a few seconds off of the run time.

You could also easily use csv.DictReader here, as several people have suggested. I'm avoiding it, as you'd have to define a custom dialect, and it's only a couple of extra lines to do the same thing without it. (The various csv module classes also check for more complex behavior (e.g. quoted strings, etc) that you don't need to worry about in this particular case. They're very, very handy in many cases, but they're slight overkill in this case.)

Note that you can also easily add your infile and outfile names as arguments when you call the script instead of hardcoding them in (i.e. infile = sys.argv[0], etc). This would also allow you to pipe data in or out easily... (You can check the length of sys.argv and set infile or outfile to sys.stdin and/or sys.stdout accordingly)

def main():
    infile = 'ugc4305_1.phot'
    outfile = 'ugc4305_1_filt.phot'
    process_data(infile, outfile)

def filter_conditions(row):
    for key, value in row.iteritems():
        row[key] = float(value)

    cond = (row['OBJ'] <= 2 and row['SNR1'] > 4 
       and row['SNR2'] > 4 and row['FLAG1'] < 8 
       and row['FLAG2'] < 8 
       and (row['SHARP1'] + row['SHARP2'])**2 < 0.075 
       and (row['CROWD1'] + row['CROWD2']) < 0.1
       )
    return cond

def format_output(row):
    output_columns = ('X', 'Y', 'OBJ', 'COUNTS1', 'BG1', 'ACS1', 'ERR1', 'CHI1', 
                     'SNR1', 'SHARP1', 'ROUND1', 'CROWD1', 'FLAG1', 'COUNTS2', 
                     'BG2', 'ACS2', 'ERR2', 'CHI2', 'SNR2', 'SHARP2', 'ROUND2', 
                     'CROWD2', 'FLAG2')
    delimiter = '\t'
    return delimiter.join((row[name] for name in output_columns))

def process_data(infilename, outfilename):
    column_names = ('EXT', 'CHIP', 'X', 'Y', 'CHI_GL', 'SNR_GL', 'SHARP_GL', 
                    'ROUND_GL', 'MAJAX_GL', 'CROWD_GL', 'OBJ', 'COUNTS1', 
                    'BG1', 'ACS1', 'STD1', 'ERR1', 'CHI1', 'SNR1', 'SHARP1', 
                    'ROUND1', 'CROWD1', 'FWHM1', 'ELLIP1', 'PSFA1', 'PSFB1', 
                    'PSFC1', 'FLAG1', 'COUNTS2', 'BG2', 'ACS2', 'STD2', 
                    'ERR2', 'CHI2', 'SNR2', 'SHARP2', 'ROUND2', 'CROWD2', 
                    'FWHM2', 'ELLIP2', 'PSFA2', 'PSFB2', 'PSFC2', 'FLAG2')

    with open(infilename) as infile:
        with open(outfilename, 'w') as outfile:
            for line in infile:
                line = line.strip().split()
                row = dict(zip(column_names, line))
                if filter_conditions(row.copy()):
                    outfile.write(format_output(row) + '\n')

if __name__ == '__main__':
    main()

Awww, you took all the fun out of it for him - Well done though ;) — JimB, Feb 26 '11 at 02:12
Holy cajones! This takes just over 70 s to run on my laptop, compared to AWK which is around 60 s - that's a speed improvement. AND the input is flexible. I think this just about incorporates everything that was suggested by JimB, nmichaels, and tkerwin. You can be sure that I'll be studying this all weekend to figure out how/why it works. I'm also going to experiment with csv.DictReader as others have suggested. Thank you so much! — Jake, Feb 26 '11 at 05:54
Some quick questions: when `line` is converted into a list in `process_data`, what is the purpose of `strip()` - don't you get the same list with `line.split()`? Also, why does `filter_conditions` get a copy of `row`, while `format_output` gets the original `row` dictionary? Thanks — Jake, Feb 27 '11 at 01:34
@Jake - `line.strip()` removes the trailing whitespace (i.e. `\n`) on the line. You can take it out if your delimiter is always whitespace. I was including it out of habit (If you're splitting on commas, for example, it's necessary. `line.split()`, will effectively remove the trailing `\n`, so it's not required in this case.) `filter_conditions` gets a copy so that the items in `row` can be converted to floats inside of the `filter_conditions` function without modifying the original strings. I was replicating the original functionality, so this code doesn't attempt to reformat the output. — Joe Kington, Feb 27 '11 at 04:09
@JimB - Thanks! @Jake - Hope it helped some anyway... Jim is right, though, probably better to learn by doing than learn by example. :) — Joe Kington, Feb 27 '11 at 04:13

score 1 · Answer 4 · answered Feb 25 '11 at 21:41

1

Like what nmichaels said, you can use the fieldnames and dialect parameters of csv.DictReader to read this file. Then, for each line you will have a dictionary. With the dictionary, you won't have to use eval, and can use statments like

if data_item['OBJ'] <= 2 and data_item['SNR1']:
     g.write(data_item['X'], data_item['Y'], data_item['OBJ'])

The way you're doing it now is slow and complicated because of all the evals. There no need for that complexity.

answered Feb 25 '11 at 21:41

tkerwin

9,559
1
31
47

OK, but I'd like to keep the conditions of my `if` statement arbitrary, which is why I resorted to a string and `eval(conditions)`. I like the idea of using csv.DictReader since it cleans things up, but I'm not sure how to get around `eval` if I want arbitrary `if` statements. – Jake Feb 25 '11 at 22:22
You mean you want it to depend on user input? In that case you'll probably want to generate a function based on a config file. If you stick with `eval`, at least use `compile` so you don't parse the string every iteration. – tkerwin Feb 26 '11 at 02:01

score 0 · Answer 5 · answered Dec 18 '17 at 06:51

Have you tried pandas?

I believe OBJ, SNR1, ... are column names and I hope you are applying the same condition on all your rows. If that's the case I suggest you to go with pandas.

Your code snippet would go something like this...

import pandas as pd

infile = 'ugc4305_1.phot'
outfile = 'ugc4305_1_filt.phot'

df = pd.read_csv(infile)

condition = (df['OBJ'] <= 2) & (df['SRN1'] > 4) & (df['SRN2'] > 4) & (df['FLAG1'] < 8) & (df['FLAG2'] < 8) & ((df['SHARP1'] + df['SHARP2'])**2 < 0.075) & ((df['CROWD1'] + df['CROWD2']) < 0.1)
newDf = df[condition]

columnNames = ['col1', 'col2', ...] # column names you want in result

newDf = df[columnNames]

newDf.to_csv(outfile)

score 0 · Answer 6 · answered Feb 26 '11 at 08:54

If profiling shows that a lot of time is spent on the actual reading and parsing of the files and you will process the same raw file many times you can try to create an intermediate file format optimized for reading with Python.

One thing to try can be to read the file once, parse and output the result with pickle/cPickle. Then read the intermediate file with pickle/cpickle in your filter script.

In don't know python well enough to tell if this will be faster than reading each line and split them. (In c# I would use a binary serializer, but I don't know if that is available in python).

If disk IO is a bottle neck you may also try to zip your input files and read them with the gzip module.

Optimized processing of very large files

6 Answers6