0

I have a large directory of *.gz files that I need to find all files that contain the required ID on a given date. My code works as expected, but it is very slow. I was curious if anybody knew of a faster was to process these files with python. 3.8.2 Any help would be appreciated.

snippet of code

def proc_data(id, date):
    os.chdir(dir)
    date = '*_' + date + '_*'
    data = subprocess.getoutput('zgrep %s %s' % (id, date,))
    return data

Sample of File names in Dir

RADT_App_20201022_0002.dat.gz

amof_app_order_20201012_4.dat.gz

Will_Roberts
  • 21
  • 1
  • 6
  • 1
    Since all that your code does is calling one external command there is little hope to make it faster in Python, except for making the external command itself faster. – mkrieger1 Oct 30 '20 at 16:43
  • Wait, that works? Without `shell=True`? – Charles Duffy Oct 30 '20 at 16:57
  • (...also, from a security perspective, that code -- if modified to have `shell=True`, which it won't work without as currently written -- is rather problematic; you don't want to run that code if someone can tell your system to do a search with an id of `$(rm -rf ~)`, or a reverse shell embedded, etc). – Charles Duffy Oct 30 '20 at 16:58
  • 1
    That code searches one file at a time. You might consider using glob.glob and multiprocessing. – Dan D. Oct 30 '20 at 17:00
  • 1
    ...but if you want genuinely _fast_ text search, that requires indexing. Full-text indexing is something there are numerous tools/libraries/database servers/etc. for, but library and tool recommendations are off-topic here. – Charles Duffy Oct 30 '20 at 17:00

0 Answers0