77

I am trying to search for all files of a given type in a given folder and copy them to a new folder.

I need to specify a root folder and search through that folder and all of its subfolders for any files that match the given type.

How do I search the root folder's subfolders and their subfolders? It seems like a recursive method would work, but I cannot implement one correctly.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
ab217
  • 16,900
  • 25
  • 74
  • 92

4 Answers4

124

Try this:

Dir.glob("#{folder}/**/*.pdf")

which is the same as

Dir["#{folder}/**/*.pdf"]

Where the folder variable is the path to the root folder you want to search through.

rogerdpack
  • 62,887
  • 36
  • 269
  • 388
  • The approach is right, but the implementation is wrong. It needs to be Dir.glob('**/*.pdf') – Jamison Dance Aug 17 '10 at 17:40
  • 3
    I think the OP wanted recursive, didn't they? – rogerdpack Jul 19 '12 at 19:43
  • The original answer (rogerdpack) worked for me, but Jergason's didn't, I'm afraid. – Joyce Mar 07 '13 at 23:00
  • 2
    @rogerdpack As far as I understand, this method is recursive. Answer should actually be `Dir.glob("#{folder}/**/*.pdf")`, where the `folder` variable is the path to the root folder you want to search through. – Automatico Oct 05 '13 at 07:22
  • This right here is indeed the right answer. e.g. I had the following use case and it worked: Dir.glob("#{ENGINE_ROOT}/lib/data/excel/**/*.{xls,XLS}") – Donato Apr 10 '15 at 21:09
  • 1
    Also case insensitive by default – leifg Sep 11 '15 at 06:22
  • 2
    @Konstantin This, or `Dir#[]`, are what I usually use. However, there is a catch: `Dir.glob` loads all of the paths into memory. This is usually fine, but if you have a great number of paths, one may prefer to use the Find module instead, since it delivers paths to the block as it finds them. – Wayne Conrad Jun 22 '16 at 23:15
  • 2
    I agree with @WayneConrad on this. You can inadvertently halt your program as Ruby allocates enough memory to store a big array. This is very similar to [slurping a file](https://stackoverflow.com/questions/25189262/why-is-slurping-a-file-not-a-good-practice). It's more efficient, and probably faster, to let `Find` process the hierarchy rather than throw it at the OS and potentially get an unexpected array. Debugging that situation is difficult. – the Tin Man Dec 06 '19 at 19:44
65

You want the Find module. Find.find takes a string containing a path, and will pass the parent path along with the path of each file and sub-directory to an accompanying block. Some example code:

require 'find'

pdf_file_paths = []
Find.find('path/to/search') do |path|
  pdf_file_paths << path if path =~ /.*\.pdf$/
end

That will recursively search a path, and store all file names ending in .pdf in an array.

AutonomousApps
  • 4,229
  • 4
  • 32
  • 42
Jamison Dance
  • 19,896
  • 25
  • 97
  • 99
28

If speed is a concern, prefer Dir.glob over Find.find.

Warming up --------------------------------------
           Find.find   124.000  i/100ms
            Dir.glob   515.000  i/100ms
Calculating -------------------------------------
           Find.find      1.242k (± 4.7%) i/s -      6.200k in   5.001398s
            Dir.glob      5.249k (± 4.5%) i/s -     26.265k in   5.014632s

Comparison:
            Dir.glob:     5248.5 i/s
           Find.find:     1242.4 i/s - 4.22x slower

 

require 'find'
require 'benchmark/ips'

dir = '.'

Benchmark.ips do |x|
  x.report 'Find.find' do
    Find.find(dir).select { |f| f =~ /\*\.pdf/ }
  end

  x.report 'Dir.glob' do
    Dir.glob("#{dir}/**/*\.pdf")
  end

  x.compare!
end

Using ruby 2.2.2p95 (2015-04-13 revision 50295) [x86_64-darwin15]

Dennis
  • 56,821
  • 26
  • 143
  • 139
  • 2
    Thank you for the post. It is very helpful for beginners like me to figure out which method should I use among `Dir.glob` vs `Find.find`. – itsh Sep 14 '16 at 18:15
  • 5
    Find should be slower in this case because you are finding with a regex. Dir.glob on the other hand, is not as powerful as a regex so I would expect it to be faster. – hirowatari Aug 18 '17 at 20:39
  • I suppose you could use `#end_with?` to compare them a little more closely.... – rogerdpack Dec 06 '19 at 21:39
  • 1
    @hirowatari Regex or not makes no difference - you can replace the whole block content with `false` and it will still be significant slower (give it a try). This is because calling a block also requires some time and happens for every item found, whereas `glob` filters internally and only will return once it is done collecting results. Therefor your filter used with `find` can be as complicated as you like, it could be 100 lines of code with lookups and multiple regexes whereas `glob` just understands one simple pattern per call. If you can express your search that way, prefer `glob`. – Mecki Mar 21 '20 at 02:23
  • But then if you have to actually do something with these files you will need to be calling something for each of them. So depending on use case comparison may or may not be fair. Also for huge directory trees, one may not want to store the whole array in memory. So sometimes one would be better, another time the other. – akostadinov Aug 13 '20 at 19:51
13

As a small improvement to Jergason and Matt's answer above, here's how you can condense to a single line:

pdf_file_paths = Find.find('path/to/search').select { |p| /.*\.pdf$/ =~ p }

This uses the Find method as above, but leverages the fact that the result is an enumerable (and as such we can use select) to get an array back with the set of matches

chrisdurheim
  • 406
  • 4
  • 6