python - How to subprocess the files on console directly (with or without using StringIO)? -


i trying read gtf file , edit (using subprocess, grep , awk) before loading pandas.

i have file name has header info (indicated #), need grep , remove first. can in python want introduce grep pipeline make processing more efficient.

i tried doing:

import subprocess io import stringio  gtf_file = open('chr2_only.gtf', 'r').read() gtf_update = subprocess.popen(["grep '^#' " + stringio(gtf_file)], shell=true) 

and

gtf_update = subprocess.popen(["grep '^#' " + gtf_file], shell=true) 

both of these codes throw error, 1st attempt was:

traceback (most recent call last):   file "/home/everestial007/pycharmprojects/stitcher/phase-stitcher-markov/markov_final_test/phase_to_vcf.py", line 39, in <module> gtf_update = subprocess.popen(["grep '^#' " + stringio(gtf_file)], shell=true) typeerror: can't convert '_io.stringio' object str implicitly 

however, if specify filename directly works:

gtf_update = subprocess.popen(["grep '^#' chr2_only.gtf"], shell=true) 

and output is:

<subprocess.popen object @ 0x7fc12e5ea588> #!genome-build v.1.0 #!genome-version jgi8x #!genome-date 2008-12 #!genome-build-accession gca_000004255.1 #!genebuild-last-updated 2008-12 

could please provide different examples problem this, , explain why getting error , why/how possible run subprocess directly on files loaded on console/memory?

i tried using subprocess call, check_call, check_output, etc., i've gotten several different error messages, these:

oserror: [errno 7] argument list long 

and

subprocess in python: file name long 

here possible solution allows send string grep. essentially, declare in popen constructor want communicate called program via stdin , stdout. send input via communicate , receive output return value communicate.

#!/usr/bin/python  import subprocess  gtf_file = open('chr2_only.gtf', 'r').read() gtf_update = subprocess.popen(["grep '^#' "], shell=true, stdin=subprocess.pipe, stdout=subprocess.pipe)  # stdout, stderr (but latter empty) gtf_filtered, _ = gtf_update.communicate(gtf_file)  print gtf_filtered 

note wise not use shell=true. therefore, popen line should written as

gtf_update = subprocess.popen(["grep", '^#'], shell=false, stdin=subprocess.pipe, stdout=subprocess.pipe) 

the rationale don't need shell parse arguments single executable. avoid unnecessary overhead. better security point of view, @ least if argument potentially unsafe comes user (think of filename containing |). (this not case here.)

note performance point of view, expect reading file directly grep faster first reading file python, , sending grep.


Comments

Popular posts from this blog

php - Permission denied. Laravel linux server -

google bigquery - Delta between query execution time and Java query call to finish -

python - Pandas two dataframes multiplication? -