2

The directory has 20k folders in it. In these folders there are subfolders and some files. I don't need to look into the subfolders. I need to get all the files with .EIA extension from the folders.

I know I could use Get-Item, Get-ChildItem for this but these cmdlet are too slow in the getting the data. Also, this script has to run every hour therefore, it cannot be taking superlong.

I was trying to use [System.IO.File]::GetFiles($path) but this gives an error

 Method invocation failed because [System.IO.File] does not contain a method named 'GetFile'

I have also tried

$pathEia = "\\Sidney2\MfgLib\AidLibTest\*\*.EIA"
 [System.IO.File]::GetFiles($pathEia)

This also throws an error:

 Exception calling "GetFiles" with "1" argument(s): "The filename, directory name, or volume label
     | syntax is incorrect. : '\\Sidney2\MfgLib\AidLibTest\*\*.EIA'"

I am using PowerShell Core 7.2 .Net Framework 4.8 Any help is appreciated. Thanks in advance.

Brute
  • 33
  • 5
  • 2
    generally speaking, the fastest way to get the file names from a large dir tree _on windows_ is to use `robocopy`. [*grin*] you can use the options to have it present you with just the full path & file name ... and it is FAST. – Lee_Dailey May 20 '22 at 22:43

2 Answers2

3

Very similar to mklement0's helpful answer but using the instance methods from DirectoryInfo.

EnumerationOptions is available starting from .NET Core 2.1. This class has the property IgnoreInaccessible set to $true by default, in prior versions an exception would cause the enumeration to Stop:

...skip files or directories when access is denied (for example, UnauthorizedAccessException or SecurityException).

This answer requires PowerShell Core 7+.

# Skip the following Attributes:
#   2.    Hidden
#   4.    System
#   1024. ReparsePoint
#   512.  SparseFile

$enum = [IO.EnumerationOptions]@{
    RecurseSubdirectories = $false # Set to `$true` if you need to do a recursive search
    AttributesToSkip      = 2, 4, 1024, 512
}

$start  = [IO.DirectoryInfo]::new('\\Sidney2\MfgLib\AidLibTest')
$result = foreach($dir in $start.EnumerateDirectories()) {
    $dir.GetFiles('*.EIA', $using:enum)
}
$result | Format-Table

If you need to do a recursive search on the subfolders (if RecurseSubdirectories = $true), you can consider using multi-threading with ForEach-Object -Parallel.

$start  = [IO.DirectoryInfo]::new('\\Sidney2\MfgLib\AidLibTest')
$result = $start.EnumerateDirectories() | ForEach-Object -Parallel {
    $_.GetFiles('*.EIA', $using:enum)
}
$result | Format-Table

It's important to note that, using a parallel loop may or may not have an edge over an efficient linear loop (such as foreach), as mklement0 notes in his comment:

Parallelism works best for different disks/shares/computers.

Santiago Squarzon
  • 41,465
  • 5
  • 14
  • 37
  • 2
    Using the `[System.IO.DirectoryInfo]` instance methods is indeed useful for accessing the properties of the `[System.IO.FileInfo]` instances returned, such as if you want the file _names_ only. Showing the pitfalls of enumeration is helpful, though probably not needed in this case. However, note that `ForEach-Object -Parallel` _slows things down_ when you're targeting a single folder: In my tests with 200 subdirs. containing 10 files of interest each, a regular, non-parallel `foreach` statement outperforms a `ForEach-Object -Parallel` by a factor of around 50(!). – mklement0 May 24 '22 at 16:15
  • 1
    @mklement0 that's interesting, could you share a gist with the code to reproduce? here are my findings, I see a better (tho minuscule) advantage using runspaces https://gist.github.com/santysq/98e69bb59286f2baebc56d346fdd37d1 I presume, the advantage will increase the more folders there are – Santiago Squarzon May 24 '22 at 16:51
  • 1
    See https://gist.github.com/mklement0/d25dff3f2a24ef9b24b80e647d776467 - note that it creates a subfolder `./tf` in the current folder and removes it again afterwards. My `Time-Command` function is downloaded on demand (I looked at the Benchpress module you pointed me to, which is more polished overall, but I prefer `Time-Command` for some additional work to level the playing field, such as running the garbage collector between tests). – mklement0 May 24 '22 at 17:11
  • 1
    @mklement0 agree that parallel is slower in this test, tho I don't see the point of running in parallel for 2000 files. the linear loop can enumerate the files the time it takes to create the runspaces in this case I believe – Santiago Squarzon May 24 '22 at 17:22
  • 1
    @Brute yes, `$result.Directory.Name` will give you the name of the parent folder of each file – Santiago Squarzon May 24 '22 at 19:55
  • 2
    Note that the premise of the question is 20,000 subfolders inside a single folder, with the files of interest inside those folders - no recursion needed. Just ran a test with this number of subfolders and 10 files in each; the slowdown with parallelism is about 25-fold(!) with the default throttle limit of 5. Based on your benchmarks, I conclude the following: When targeting a single disk, parallel threads can provide a speed benefit if each thread needs to perform deeply nested recursion; if not, they are likely to slow things down. Parallelism works best for different disks/shares/computers. – mklement0 May 24 '22 at 22:47
  • 1
    @mklement0 agreed, after testing using 20k subfolders without recursion it's clearly bad – Santiago Squarzon May 25 '22 at 00:30
2

Try the following:

$path = '\\Sidney2\MfgLib\AidLibTest'
$allFilePathsOfInterest =
  foreach ($dir in [System.IO.Directory]::GetDirectories($path)) {
    [System.IO.Directory]::GetFiles($dir, '*.EIA')
  }

Given that the input directory path is a full path, $allFilesOfInterest is an array of full file paths too.

If you want the file names only, use the instance methods of the [System.IO.DirectoryInfo] type instead of the static methods of the [System.IO.Directory] type, which allows you to access the .Name property of the [System.IO.FileInfo] instances being returned:

$path = '\\Sidney2\MfgLib\AidLibTest'
$allFileNamesOfInterest =
  foreach ($dir in [System.IO.DirectoryInfo]::new($path).GetDirectories()) {
    $dir.GetFiles('*.EIA').Name
  }
  • Note the two-step approach - get subdirectories first, then examine their files - because I'm not aware of a standard .NET API that would allow you to process wildcards across levels of the hierarchy (e.g., \\Sidney2\MfgLib\AidLibTest\*\*.EIA').

  • If you need more control over the enumeration of the files and directories, the GetDirectories and GetFiles methods offer overloads that accept a System.IO.EnumerationOptions instance, but, unfortunately, in PowerShell (Core) 7+ / .NET (Core) only:

    • Windows PowerShell / .NET Framework only offers overloads with a System.IO.SearchOption instance, but the only thing that controls is whether the enumeration is recursive.
mklement0
  • 382,024
  • 64
  • 607
  • 775