4

I'm using os.walk to select files from a specific folder which match a regular expression.

for dirpath, dirs, files in os.walk(str(basedir)):
    files[:] = [f for f in files if re.match(regex, os.path.join(dirpath, f))]
    print dirpath, dirs, files

But this has to process all files and folders under basedir, which is quite time consuming. I'm looking for a way to use the same regular expression used for files to filter out unwanted directories in each step of the walk. Or a way to match only part of the regex...

For example, in a structure like

/data/2013/07/19/file.dat

using e.g. the following regular expression

/data/(?P<year>2013)/(?P<month>07)/(?P<day>19)/(?P<filename>.*\.dat)

find all .dat files without needing to look into e.g. /data/2012

RogerFC
  • 329
  • 3
  • 15
  • 2
    Did you have an actual question? – Martijn Pieters Jul 19 '13 at 13:14
  • And `'%s' % (p.basedir)` is just a inefficient way of saying `str(p.basedir)`, isn't it? – Martijn Pieters Jul 19 '13 at 13:14
  • Take a look at `os.path.join()` to build paths from parts. – Martijn Pieters Jul 19 '13 at 13:15
  • And `os.walk()` doesn't care what you do with `files`, so slice assigning is overkill here. – Martijn Pieters Jul 19 '13 at 13:16
  • `'^%s/%d'` is not a regular expression; unless you wanted to match the *literal* text `%s/%d`. I doubt that that is what you were trying to achieve. – Martijn Pieters Jul 19 '13 at 13:17
  • 1
    sorry, commited on mid edit. it's complete now – RogerFC Jul 19 '13 at 13:25
  • And what is your regular expression? You cannot match 'partial regular expressions', but we can see what we can do. I saw you already had a form of `dirs[:] = [...]` to filter out directories. – Martijn Pieters Jul 19 '13 at 13:28
  • There is nothing in that regular expression to indicate that 2012 should not be searched. – Martijn Pieters Jul 19 '13 at 13:35
  • @MartijnPieters true, changed that in the code – RogerFC Jul 19 '13 at 13:39
  • I removed the dirs[:] = [...] part as it was just a copy of a failed test. The idea was to find some function to filter out dirs in a similar way as files, but I did not manage to, so I removed that part not to put it as a requirement. – RogerFC Jul 19 '13 at 13:42
  • Regular expressions are really the wrong tool here; you cannot do a partial match. 'Padding out' the path would require you to generate all possible options your directories could ever want to cover, for example. I'd look for a *different* data structure to express what files you are looking for, or just put up with scanning all directories. – Martijn Pieters Jul 19 '13 at 13:45
  • I'm adding a feature to an existing sw, so there's so much I can change. Looks like i'm generating a set of partial regex, then. Thanks! – RogerFC Jul 19 '13 at 14:03
  • 2
    sounds like a job for `glob` instead, e.g. `for filename in glob.iglob('/data/2013/07/19/*.dat'):` but I'm not sure what the question is – Tommi Komulainen Jul 19 '13 at 16:26
  • In the actual code the regex is used to extract metadata from the filename structure and some subdir, so glob is not an option. I tried to extract the single problem from a complex code, and probably the question itself was not well defined. I'll try to reformulate. – RogerFC Jul 19 '13 at 19:05

2 Answers2

1

If, for example, you want only files in /data/2013/07/19 to be processed, just start the os.walk() from directory top /data/2013/07/19. This is similar to Tommi Komulainen's suggestion, but you needn't modify the loop code.

Armali
  • 18,255
  • 14
  • 57
  • 171
0

I stumbled upon this problem (it's pretty clear what the problem is, even if there's no actual question) so since no one answered I guess it might be useful even if quite late.

You need to split the original RE into segments, so you can filter intermediate directories inside the loop. Filter, and then match the files.

regex_parts = regex.split("/")
del regex_parts[0]  # Because [0] = "" it's not needed

for base, dirs, files in os.walk(root):
   if len(regex_parts) > 1:
       dirs[:] = [dir for dir in dirs if re.match(regex_parts[0], dir)]
       regex_parts[:] = regex_parts[1:]
       continue

   files[:] = [f for f in files if re.match(regex, os.path.join(dirpath, f))]

Since you are matching files (the last part of the path), there's no reason to do the actual match until you filter out as much as possible. The len check is there so directories that might match the last part don't get clobbered. This could possibly be made more efficient, but it worked for me (I just today had a similar problem).

NZP
  • 175
  • 7