0

I am trying to read a gtf file and then edit it (using subprocess, grep and awk) before loading into pandas.

I have a file name that has header info (indicated by #), so I need to grep that and remove it first. I can do it in python but I want to introduce grep into my pipeline to make processing more efficient.

I tried doing:

import subprocess
from io import StringIO

gtf_file = open('chr2_only.gtf', 'r').read()
gtf_update = subprocess.Popen(["grep '^#' " + StringIO(gtf_file)], shell=True)

and

gtf_update = subprocess.Popen(["grep '^#' " + gtf_file], shell=True)

Both of these codes throw an error, for the 1st attempt it was:

Traceback (most recent call last):
  File "/home/everestial007/PycharmProjects/stitcher/pHASE-Stitcher-Markov/markov_final_test/phase_to_vcf.py", line 39, in <module> gtf_update = subprocess.Popen(["grep '^#' " + StringIO(gtf_file)], shell=True)
TypeError: Can't convert '_io.StringIO' object to str implicitly

However, if I specify the filename directly it works:

gtf_update = subprocess.Popen(["grep '^#' chr2_only.gtf"], shell=True)

and the output is:

<subprocess.Popen object at 0x7fc12e5ea588>
#!genome-build v.1.0
#!genome-version JGI8X
#!genome-date 2008-12
#!genome-build-accession GCA_000004255.1
#!genebuild-last-updated 2008-12

Could someone please provide different examples for problem like this, and also explain why am I getting the error and why/how it would be possible to run subprocess directly on files loaded on console/memory?

I also tried using subprocess with call, check_call, check_output, etc., but I've gotten several different error messages, like these:

OSError: [Errno 7] Argument list too long

and

Subprocess in Python: File Name too long
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
everestial007
  • 6,665
  • 7
  • 32
  • 72
  • I've seen "Subprocess in Python: File Name too long" before. This happened to me when I was unintentionally passing the text within an open file as an argument to Popen instead of a string for the filename path. When you assign `gtf_update = subprocess.Popen(["grep '^#' chr2_only.gtf"], shell=True)`, you can then output that command using `gtf_update.communicate()` – JacobIRR Apr 14 '17 at 16:44
  • In your `grep` command, you have to pass the *filename*, as you do in your working example. In the first ones, you try to add a StringIO object or a file object to the string `"grep '^#' "`, and Python can't convert them to a string. Just append the filename instead. – Thierry Lathuille Apr 14 '17 at 16:44
  • But, I don't want to read the file name on the `subprocess`. The reason being I have to use this file for different branches of the pipeline. So, I want to read it in a pythonic way `data = open(....).read()', then use this `data` to do different things. Just trying to save memory and time by not reading the same files over and over again. – everestial007 Apr 14 '17 at 16:50
  • 1
    If you read the file's lines into Python eventually, you likely have nothing to win by using an external `grep` over trivial `if not line.startswith('#')` Python code. If you have a non-trivial pipeline with `grep` and `awk`, write a shell script and make it work first (with `awk` you don' need `grep`). If `grep` eliminated most of the input lines in a non-trivial way, it could benefit performance. Is it so in your case? Try avoiding premature optimization, that is, optimization prior to actually (not theoretically) detecting a bottleneck. – 9000 Apr 14 '17 at 17:28
  • Hmm, something to think about ! I am trying different things. Lets see how `awk` responds. – everestial007 Apr 14 '17 at 17:33

1 Answers1

2

Here is a possible solution that allows you to send a string to grep. Essentially, you declare in the Popen constructor that you want to communicate with the called program via stdin and stdout. You then send the input via communicate and receive the output as return value from communicate.

#!/usr/bin/python

import subprocess

gtf_file = open('chr2_only.gtf', 'r').read()
gtf_update = subprocess.Popen(["grep '^#' "], shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE)

# stdout, stderr (but latter is empty)
gtf_filtered, _ = gtf_update.communicate(gtf_file)

print gtf_filtered

Note that it is wise not to use shell=True. Therefore, the Popen line should be written as

gtf_update = subprocess.Popen(["grep", '^#'], shell=False, stdin=subprocess.PIPE, stdout=subprocess.PIPE)

The rationale is that you don't need the shell to parse the arguments to a single executable. So you avoid unnecessary overhead. It is also better from a security point of view, at least if some argument is potentially unsafe as it comes from a user (think of a filename containing |). (This is obviously not the case here.)

Note that from a performance point of view, I expect that reading the file directly with grep is faster than first reading the file with python, and then sending it to grep.

W.Mann
  • 893
  • 5
  • 11