-3

The following command works fine when executed from shell. I would like to do the same thing (or rather, get the same output) in a python script. But no matter what I do I always end up with some quotation-mark-errors. I have tried implementing this with os.system..., subprocess.Popen..., shlex.split... with out any luck.

comm -13 <(grep -e 77772 -e 77778 -e 777710 myfile1.dat |
             awk 'BEGIN {FS=";"} ; {print $8 "," $1}' | 
             sort -t '.' -k 1,1 -k 2,2) \
         <(grep -e 77772 -e 77778 -e 777710 myfile2.dat |
             awk 'BEGIN {FS=";"} ; {print $8 "," $1}' |
             sort -t '.' -k 1,1 -k 2,2) |
      tee output.dat

(I am basically selecting lines from two files that contain 77772 or 77778 or 777710, selecting two columns (column1 and column8) from those lines, sorting them to find lines that are unique to myfile2.dat - and write those lines to output.dat).

Is there a simpler way to do this?

chepner
  • 497,756
  • 71
  • 530
  • 681
  • 3
    _"But no matter what I do I always end up with some quotation-mark-errors"_. Please show us some code that produces a quotation mark error. – Kevin Jun 10 '14 at 16:57
  • How big are the files? It's probably simpler to just read the relevant data from `myfile1.dat` into memory, then iterating over `myfile2.dat` and writing unique lines to standard out and `output.dat` as you find them. Don't fork out what you can easily do in Python. – chepner Jun 10 '14 at 17:09

1 Answers1

1

The actual question is easy to answer

subprocess.call(['bash', '-c',
                 '''comm -13 '''
                 ''' <(grep -e 77772 -e 77778 -e 777710 myfile1.dat | '''
                 '''    awk 'BEGIN {FS=";"} ; {print $8 "," $1}' | '''
                 '''    sort -t '.' -k 1,1 -k 2,2) '''
                 ''' <(grep -e 77772 -e 77778 -e 777710 myfile2.dat | '''
                 '''    awk 'BEGIN {FS=";"} ; {print $8 "," $1}' | '''
                 '''    sort -t '.' -k 1,1 -k 2,2) '''
                 ''' | tee output.dat'''],
               )

This passes your entire pipeline as a single string to an instance of bash. (It uses implicit joining of adjacent strings for readability.

However, I'd recommend implementing this entirely in Python, rather than forking multiple processes. Read the relevant data from myfile1.dat into memory (assuming it isn't too large), the process lines from myfile2.dat one at a time, outputting the line if its fields are not found in the data you read from myfile1.dat.

chepner
  • 497,756
  • 71
  • 530
  • 681
  • Thank you for your answer! The files are just over 5k lines. I will do as you suggested; to read the files into memory. – user2788485 Jun 10 '14 at 17:27