Using Python to search multiple text files for matches to a list of strings

Question

So am starting from scratch on a program that I haven't really seen replicated anywhere else. I'll describe exactly what I want it to do:

I have a list of strings that looks like this:

12482-2958
02274+2482
23381-3857
..........

I want to take each of these strings and search through a few dozen files (all named wds000.dat, wds005.dat, wds010.dat, etc) for matches. If one of them finds a match, I want to write that string to a new file, so in the end I have a list of strings that had matches.

If I need to be more clear about something, please let me know. Any help on where to start with this would be much appreciated. Thanks guys and gals!

The standard/most obvious approach to this in Python is to start with a list of filenames (perhaps created by `glob.glob()`) and iterate through that list. For each filename, open the file and then iterate through the lines of text in it.... But until you make a start on attempting it and can identify a specific barrier to progress, StackOverflow isn't the place to get the right help. — jez, May 02 '16 at 20:02
If on unix, it would be better, simpler and faster to make use of sed, grep or awk. Why Python? — Spade, May 02 '16 at 20:04

HackerShark · Accepted Answer · 2016-05-05T22:01:38.117

5

Something like this should work

import os

#### your array ####
myarray = {"12482-2958", "02274+2482", "23381-3857"}

path = os.path.expanduser("path/to/myfile")
newpath = os.path.expanduser("path/to/myResultsFile")
filename = 'matches.data'
newf = open(os.path.join(newpath, filename), "w+")

###### Loops through every element in the above array ####
for element in myarray:
    elementstring=''.join(element)

    #### opens the path where all of your .dat files are ####
    files = os.listdir(path)
    for f in files:
        if f.strip().endswith(".dat"):
            openfile = open(os.path.join(path, f), 'rb')
            #### loops through every line in the file comparing the strings ####
            for line in openfile:
                if elementstring in line:
                        newf.write(line)
           openfile.close()
newf.close()

edited May 05 '16 at 22:01

answered May 02 '16 at 20:33

HackerShark

211
2
7

Awesome, going to try this asap. For the "myarray" part, if I have all of the strings listed in one file, should I just use numpy to import the array? – uhurulol May 02 '16 at 21:27
Yes that's definitely one way of doing it. If you want to try reading it directly from the file itself check out http://stackoverflow.com/a/12370456/6278685. I am not near my computer to test it out but that answer looked promising. – HackerShark May 02 '16 at 22:07
I used numpy to import my array, and this solution worked like a charm! Thanks so much, you've been super helpful. – uhurulol May 05 '16 at 17:55
1

No problem, glad I could help! – HackerShark May 05 '16 at 17:59
So I added two lines to this that completely broke it. In the first block I define ``newf = open("matches.data","w")`` and then directly after ``print line`` I have ``newf.write(line)`` so that I can record all of the results in a file and look back later. However, now the program loops over and over again on the first string and won't move to the next one. If I remove these two lines, it runs fine! Halp =( – uhurulol May 05 '16 at 20:36
I suppose to dumb this down a bit, I just wanna know how I can get the printed lines to write to a file so I can look at them later. – uhurulol May 05 '16 at 21:24
Edited my above answer to add that. You can also do other things to format such as `newf.write(f)` (right above `openfile = open(os.path.join(path, f), 'rb')`) that way as it writes to the file it you know which results came from which file. – HackerShark May 05 '16 at 21:58

score 1 · Answer 2 · answered May 02 '16 at 20:01

Define a function that gets a path and a string and checks for match.
You can use: open(), find(), close() Then just create all paths in a for loop, for every path check all strings with the function and print to file if needed

Not explained much... Needing more explaining?

score 0 · Answer 3 · answered May 02 '16 at 20:08

0

Not so pythonic... and probably has something to straighten out but pretty much the logic to follow:

from glob import glob
strings = ['12482-2958',...] # your strings
output = []
for file in glob('ws*.dat'):
    with open(file, 'rb+') as f:
        for line in f.readlines():
            for subs in strings:
                if subs in line:
                    output.append(line)
print(output)

answered May 02 '16 at 20:08

2

A `find`, `grep -f`, `>>` combination in plain *nix shell would do the job easily too – May 02 '16 at 20:11

Using Python to search multiple text files for matches to a list of strings

3 Answers3