0

I have a simple code that reads in a data file ~2Gb, extracts the columns of data that I need and then writes that data as columns to another file for later processing. I ran the code last night and it took close to nine hours to complete. I ran the two sections separately and have determined that the portion that writes the data to a new file is the problem. I would like to ask if anyone can point out why it is so slow the way I have written it as well as suggestions on a better method.

sample of data being read in

26980300000000  26980300000000  39  13456502685696  1543    0
26980300000001  26980300000000  38  13282082553856  1523    0.01
26980300000002  26980300000000  37  13465223692288  1544    0.03
26980300000003  26980300000000  36  13290803560448  1524    0.05
26980300000004  26980300000000  35  9514610851840   1091    0.06
26980300000005  26980300000000  34  9575657897984   1098    0.08
26980300000006  26980300000000  33  8494254129152   974     0.1
26980300000007  26980300000000  32  8520417148928   977     0.12
26980300000008  26980300000000  31  8302391459840   952     0.14
26980300000009  26980300000000  30  8232623931392   944     0.16

Code

F = r'C:\Users\mass_red.csv'

def filesave(TID,M,R):     
  X = str(TID)
  Y = str(M)
  Z = str(R) 
  w = open(r'C:\Users\Outfiles\acc1_out3.txt','a')
  w.write(X)
  w.write('\t')
  w.write(Y)
  w.write('\t')
  w.write(Z)
  w.write('\n')
  w.close()
  return()

N = 47000000
f = open(F)           
f.readline()          
nlines = islice(f, N) 

for line in nlines:                 
 if line !='':
      line = line.strip()         
      line = line.replace(',',' ') 
      columns = line.split()       
      tid = int(columns[1])
      m = float(columns[3])  
      r = float(columns[5])             
      filesave(tid,m,r)
Stripers247
  • 2,265
  • 11
  • 38
  • 40
  • unrelated: use `print(TID, M, R, sep='\t', file=w)` instead of 6 `w.write()` calls. – jfs Feb 16 '15 at 08:57
  • also `return()` returns an empty tuple. You can drop this line. – jfs Feb 16 '15 at 08:58
  • There are 6 columns in your example but `column[6]` tries to access 7th column (Python indexes start at 0). – jfs Feb 16 '15 at 09:00
  • @J.F.Sebastian, can I drop `return()` whenever the function does not actually return a value? Also you are correct about the indexing, I copied the sample data from a smaller testing set, but the code is from the set I work with. I have edited the code to reflect this. – Stripers247 Feb 16 '15 at 16:56
  • 1
    every function in Python returns a value. If you don't need to return anything then return `None`: a bare `return` returns `None` and if you omit `return` statement completely it also returns `None`. Note: a tuple `()` is not `None`. – jfs Feb 17 '15 at 01:49

3 Answers3

2

You open and close the file for each line. Open it once at the beginning.

StenSoft
  • 9,369
  • 25
  • 30
1

In modern Python, most file use should be done with with statements. Open is easily seen to be done once in the header, and close is automatic. Here is a general template for line processing.

inp = r'C:\Users\mass_red.csv'
out = r'C:\Users\Outfiles\acc1_out3.txt'
with open(inp) as fi, open(out, 'a') as fo:
    for line in fi:
        ...
        if keep:
            ...
            fo.write(whatever)
Terry Jan Reedy
  • 18,414
  • 3
  • 40
  • 52
1

Here's a simplified but complete version of your code:

#!/usr/bin/env python
from __future__ import print_function
from itertools import islice

nlines_limit = 47000000
with open(r'C:\Users\mass_red.csv') as input_file, \
     open(r'C:\Users\Outfiles\acc1_out3.txt', 'w') as output_file:
    next(input_file) # skip line
    for line in islice(input_file, nlines_limit):
        columns = line.split()       
        try:
            tid = int(columns[1])
            m = float(columns[3])  
            r = float(columns[5])             
        except (ValueError, IndexError):
            pass # skip invalid lines
        else:
            print(tid, m, r, sep='\t', file=output_file)

I don't see commas in your input; so I've removed line.replace(',', ' ') from the code.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Just reread your answer, the file I use as input is a csv file. When I copied and pasted the sample data just the columns got copied. – Stripers247 Feb 17 '15 at 14:41
  • 1
    @Surfcast23: if the separator is a comma then use `line.split(',')` instead of `line.split()`. `int` and `float` ignore leading/trailing whitespace. If there could be quoted fields such as `1,"a, b",2` then use `csv` module to parse the file. – jfs Feb 17 '15 at 15:06