I am trying to read a gtf file and then edit it (using subprocess, grep and awk) before loading into pandas.
I have a file name that has header info (indicated by #
), so I need to grep that and remove it first. I can do it in python but I want to introduce grep
into my pipeline to make processing more efficient.
I tried doing:
import subprocess
from io import StringIO
gtf_file = open('chr2_only.gtf', 'r').read()
gtf_update = subprocess.Popen(["grep '^#' " + StringIO(gtf_file)], shell=True)
and
gtf_update = subprocess.Popen(["grep '^#' " + gtf_file], shell=True)
Both of these codes throw an error, for the 1st attempt it was:
Traceback (most recent call last):
File "/home/everestial007/PycharmProjects/stitcher/pHASE-Stitcher-Markov/markov_final_test/phase_to_vcf.py", line 39, in <module> gtf_update = subprocess.Popen(["grep '^#' " + StringIO(gtf_file)], shell=True)
TypeError: Can't convert '_io.StringIO' object to str implicitly
However, if I specify the filename directly it works:
gtf_update = subprocess.Popen(["grep '^#' chr2_only.gtf"], shell=True)
and the output is:
<subprocess.Popen object at 0x7fc12e5ea588>
#!genome-build v.1.0
#!genome-version JGI8X
#!genome-date 2008-12
#!genome-build-accession GCA_000004255.1
#!genebuild-last-updated 2008-12
Could someone please provide different examples for problem like this, and also explain why am I getting the error and why/how it would be possible to run subprocess directly on files loaded on console/memory?
I also tried using subprocess
with call, check_call, check_output, etc.
, but I've gotten several different error messages, like these:
OSError: [Errno 7] Argument list too long
and
Subprocess in Python: File Name too long