1

I have a function that flattens directories in parallel for multiple folders. It works great when I call it in a non-pipeline fashion:

$Files = Get-Content $FileList
Merge-FlattenDirectory -InputPath $Files

But now I want to update my function to work both on the pipeline as well as when called off the pipeline. Someone on discord recommended the best way to do this is to defer all processing to the end block, and use the begin and process blocks to add pipeline input to a list. Basically this:

function Merge-FlattenDirectory {
    [CmdletBinding()]
    param (
        [Parameter(Mandatory,Position = 0,ValueFromPipeline)]
        [string[]]
        $InputPath
    )

    begin {
        $List = [System.Collections.Generic.List[PSObject]]@()
    }

    process {
        if(($InputPath.GetType().BaseType.Name) -eq "Array"){
            Write-Host "Array detected"
            $List = $InputPath
        } else {
            $List.Add($InputPath)
        }
    }

    end {

        $List | ForEach-Object -Parallel {

            # Code here...
            
        } -ThrottleLimit 16
    }
}

However, this is still not working on the pipeline for me. When I do this:

$Files | Merge-FlattenDirectory

It actually passes individual arrays of length 1 to the function. So testing for ($InputPath.GetType().BaseType.Name) -eq "Array" isn't really the way forward, as only the first pipeline value gets used.

My million dollar question is the following:

What is the most robust way in the process block to differentiate between pipeline input and non-pipeline input? The function should add all pipeline input to a generic list, and in the case of non-pipeline input, should skip this step and process the collection as-is moving directly to the end block.

The only thing I could think of is the following:

if((($InputPath.GetType().BaseType.Name) -eq "Array") -and ($InputPath.Length -gt 1)){
    $List = $InputPath
} else {
    $List.Add($InputPath)
}

But this just doesn't feel right. Any help would be extremely appreciated.

fmotion1
  • 237
  • 4
  • 12

2 Answers2

2

You might just do

function Merge-FlattenDirectory {
    [CmdletBinding()]
    param (
        [Parameter(Mandatory,Position = 0,ValueFromPipeline)]
        [string[]]
        $InputPath
    )
    begin {
        $List = [System.Collections.Generic.List[String]]::new()
    }
    process {
        $InputPath.ForEach{ $List.Add($_) }
    }
    end {
        $List |ForEach-Object -Parallel {
            # Code here...
        } -ThrottleLimit 16
    }
}

Which will process the input values either from the pipeline or the input parameter.

But that doesn't comply with the Strongly Encouraged Development Guidelines to Support Well Defined Pipeline Input (SC02) especially for Implement for the Middle of a Pipeline

This means if you correctly want to implement the PowerShell Pipeline, you should directly (parallel) process your items in the Process block and immediately output any results from there:

function Merge-FlattenDirectory {
    [CmdletBinding()]
    param (
        [Parameter(Mandatory,Position = 0,ValueFromPipeline)]
        [string[]]
        $InputPath
    )
    begin {
        $SharedPool = New-ThreadPool -Limit 16
    }
    process {
        $InputPath |ForEach-Object -Parallel -threadPool $Using:SharedPool {
            # Process your current item ($_) here ...
        }
    }
}

In general, script authors are advised to use idiomatic PowerShell which often comes down to lesser object manipulations and usually results in a correct PowerShell pipeline implementation with less memory usage.

Please let me know if you intent to collect (and e.g. order) the output based on this suggestion.

Caveat

The full invocation of the ForEach-Object -Parallel cmdlet itself is somewhat inefficient as you open and close a new pipeline each iteration. To resolve this, my whole general statement about idiomatic PowerShell falls a bit apart, but should be resolvable by using a steppable pipeline.

To implement this, you might use the ForEach-Object cmdlet as a template:

[System.Management.Automation.ProxyCommand]::Create((Get-Command ForEach-Object))

And set the ThrottleLimit of the ThreadPool in the Begin Block

function Merge-FlattenDirectory {
    [CmdletBinding()]
    param (
        [Parameter(Mandatory,Position = 0,ValueFromPipeline)]
        [string[]]
        $InputPath
    )
    begin {
        $PSBoundParameters += @{
            ThrottleLimit = 4
            Parallel = {
                Write-Host (Get-Date).ToString('HH:mm:ss.s') 'Started' $_
                Start-Sleep -Seconds 3
                Write-Host (Get-Date).ToString('HH:mm:ss.s') 'finished' $_
            }
        }
        $wrappedCmd = $ExecutionContext.InvokeCommand.GetCommand('ForEach-Object', [System.Management.Automation.CommandTypes]::Cmdlet)
        $scriptCmd = {& $wrappedCmd @PSBoundParameters }
        $steppablePipeline = $scriptCmd.GetSteppablePipeline($myInvocation.CommandOrigin)
        $steppablePipeline.Begin($PSCmdlet)
    }
    process {
        $InputPath.ForEach{ $steppablePipeline.Process($_) }
    }
    end {
        $steppablePipeline.End()
    }
}
1..5 |Merge-FlattenDirectory
17:57:40.40 Started 3
17:57:40.40 Started 2
17:57:40.40 Started 1
17:57:40.40 Started 4
17:57:43.43 finished 3
17:57:43.43 finished 1
17:57:43.43 finished 4
17:57:43.43 finished 2
17:57:43.43 Started 5
17:57:46.46 finished 5
iRon
  • 20,463
  • 10
  • 53
  • 79
  • Great. I didn't know you could directly process the pipeline in the process block and still maintain multi-threading capability. Thank you for that. For this function I didn't intend on collecting and ordering the results, but it would be nice to know how you would accomplish that. Thanks again! – fmotion1 Sep 08 '22 at 09:58
  • I just tried your second example with the `ForEach-Object` call in the process block, and it doesn't look like parallelism is being maintained. The pipeline is being fed sequentially and processing is happening in steps. I'm going to test execution time for both examples. – fmotion1 Sep 08 '22 at 10:15
  • OK. So I've tested and measured both approaches with the same dataset. With the processing and `ForEach-Object` call directly in the process block, it takes `15.70 seconds` to complete. With the processing deferred to the end block, it takes `7.58 seconds`. So there is an enormous performance benefit to collecting and then executing everything in the end block. I appreciate the appeal to follow best practices but if I'm leaving that much horsepower on the table I'd much rather go with option 2. Any ideas? – fmotion1 Sep 08 '22 at 10:23
  • As the PowerShell Pipeline is processing each next item as soon as it is passed on (to the `ForEach-Object -parallel` cmdlet), the parallelism should be maintained (I would add some `Start-Sleep` commands in the commands to prove that). But the invocation of the call itself *is inefficient* (a full invocation of the wrapped cmdlet in every iteration, in a nested pipeline) besides there is a big **ceveat**, see my updated answer and new request for a: [`#18054` Scope based `-ThrottleLimit`](https://github.com/PowerShell/PowerShell/issues/18054) – iRon Sep 08 '22 at 11:01
  • 1
    In your own link above about a steppable pipeline, @mklement0 details that deferring processing to the end block actually performs better than the proxy function. Look at the `Select-MatchCollect` function in the snippet he posted. It's more memory-intensive as you mention since it's inherently inefficient, but if you're looking for the best possible performance it's the option to choose from what I can tell. I'm going to compare execution times and I may go with the proxy function instead since it's more "proper" and memory efficient. Thanks for the links and all your help. – fmotion1 Sep 08 '22 at 12:46
  • I have done a new update again as the `-ThrottleLimit` parameter issue can be resolved with a `-threadPool` (thanks to [@Jhoneil](https://github.com/jhoneill) [comment](https://github.com/PowerShell/PowerShell/issues/18054#issuecomment-1240972548)). For the performance issue: as far as I can define it is caused by the overhead of call of the full cmdlet which should be less of an impact if the threads itself take more time as in my steppable pipeline example. – iRon Sep 08 '22 at 17:42
1

Here's how I would write it with comments where I have changed it.

function Merge-FlattenDirectory {
    [CmdletBinding()]
    param (
        [Parameter(Mandatory,Position = 0,ValueFromPipeline)]
         $InputPath    # <this may be a string, a path object, a file object, 
                       # or an array
    )

    begin {
        $List = @()   # Use an array for less than 100K objects. 
    }

    process {
        #even if InputPath is a string for each will iterate once and set $p
        #if it is an array of strings add each. If it is one or more objects,
        #try to find the right property for the path. 
        foreach  ($p in $inputPath)  {
          if     ($p -is [String]) {$list += $p } 
          elseif ($p.Path)         {$list += $p.Path}
          elseif ($p.FullName)     {$list += $p.FullName}
          elseif ($p.PSPath)       {$list += $p.PSPath}
          else                     {Write-warning "$P makes no sense"} 
        } 

    }
    
    end {

        $List | ForEach-Object -Parallel {

            # Code here...
            
        } -ThrottleLimit 16
    }
}

@iRon That "write for the middle of the pipeline" in the docs does not mean write everything in the process block .

function one { @(1,2,3,4,5) }

function two {
    param   ([parameter(ValueFromPipeline=$true)] $p  )
    begin   {Write-host "Two begins"      ; $a  = @() }
    process {Write-host "Two received $P" ; $a += $p  }
    end     {Write-host "Two ending"      ; $a; Write-host "Two ended"}       
}

function three {
    param   ([parameter(ValueFromPipeline=$true)] $p )
    begin   {Write-host "three Starts"    }
    process {Write-host "Three received $P" }
    end     {Write-host "Three ended"    }       
}
 one | two | three

One is treated as an end block. One, two and three all run their begins (one's is empty). One's output goes to the process block in two, which just collects the data. Two's end block starts after one's end-block ends, and sends output At this point three's process block gets input. After two's end block ends, three's endblock runs.

Two is "in the middle" it has a process block to deal with multiple piped items (if it were all one 'end' block it would only process the last one).

  • "Two's end block starts after one's end-block ends", no, Two's `end` block start after Two's `process` block has completed. – Santiago Squarzon Sep 08 '22 at 18:15
  • Run the code from the example, and you can see it better with two and three You get the message "Two Ending", then "Three received 1/2/3/4/5", then "Two ended" So we go from three's `process` block back to two's `end` block. When that block finishes the next line of code to run is "Three ended" from three's `end` block. We *don't* go from "Three received 5" straight to "Three ended" I put the 2 Write-Hosts in two's `end` block to make that clear. – James O'Neill Sep 09 '22 at 13:47
  • This is fantastic and exactly what I needed. I really appreciate your approach. I will be using this as a template moving forward for my multi-threaded cmdlets. – fmotion1 Sep 09 '22 at 21:01
  • @JamesO'Neill why should I avoid an ArrayList or Generic List for under 100K objects? Is there that much overhead? I generally avoid `+=` when working with arrays as well. Another stupid question, where are you checking if the input is an array? I see the following conditionals: `($p -is [String])` `elseif ($p.Path)` `elseif ($p.FullName)` `elseif ($p.PSPath)` It's not immediately apparent to me what conditional is testing for the presence of an array. – fmotion1 Sep 09 '22 at 21:06
  • += gets slower as the size of what is adding to gets bigger. So you get order n-squared time. If n is small (a few thousand) it really doesn't matter. If n is 100K, it matters quite a lot. Some people will use a string builder rather than += when there is a string made of 3 or 4 parts, or one of the other list / array types when there is a small array. I stick to the familiar / simple form when I'm expecting small numbers. It's easier when someone else needs to understand what I've done. – James O'Neill Sep 10 '22 at 11:43
  • I don't explicitly check for an array. I do a foreach X in Y , so if Y is an an array the loop runs once for each item, and if it is single item it runs once . Then it says did I get a string, a file object , a path object, or something else with the right properties – James O'Neill Sep 10 '22 at 11:46