3

I have a csv file that contains approx 250,000 rows. The first of the headers on this csv file is "Campaign Ref Code". I also have a list of campaign reference codes that I need to look up in this first column (ie COLMABQ140, COLMABQ141). If I find this campaign ref code in the first column I want to output the information for this campaign code (ie address, date contacted etc) into a separate csv file. I have a number of these codes and need to produce a number of separate files specific to each separate campaign code. I have actually managed to get a working script for this BUT it takes over an hour to run and I'd like to find a quicker way of processing this file. Basically, I create an array of the campaign codes then loop through them using Import-csv & exporting to a separate file using Export-csv if I find a match (see a snippet of the code below - there are 20 different $refs arrays & corresponding for loops). Like I say, it does work perfectly well in as much as it gets the information I need in a format I specified but there has to be a quicker way of doing things...even if the original file is a quarter of a million rows!!

$file = 'Orig.csv'
$newfile = "File1.$today.csv"
        
$refs = @('COLMABQ140','COLMABQ141','COLMABQ142','COLMABQ143','COLMABQ144','COLMABQ176','COLMABQ177','COLMABQ178','COLMABQ179','COLMABQ180')

    foreach ($ref in $refs) {

        Import-csv $file | Where-Object {$_.'Campaign Ref Code' -like "$ref"} | Export-CSV -notype -Path $workdir\$newfile -Append 
    
    }
        
$newfile = "File2.$today.csv"
        
$refs = @('COLMABP140','COLMABP141','COLMABP142','COLMABP143','COLMABP144','COLMABP176','COLMABP177','COLMABP178','COLMABP179','COLMABP180')

    foreach ($ref in $refs) {

        Import-csv $file | Where-Object {$_.'Campaign Ref Code' -like "$ref"} | Export-CSV -notype -Path $workdir\$newfile -Append 
    
    }
        
$newfile = "File3.$today.csv"
        
$refs = @('COLMABS140','COLMABS141','COLMABS142','COLMABS143','COLMABS144','COLMABS176','COLMABS177','COLMABS178','COLMABS179','COLMABS180')

    foreach ($ref in $refs) {

        Import-csv $file | Where-Object {$_.'Campaign Ref Code' -like "$ref"} | Export-CSV -notype -Path $workdir\$newfile -Append 
    
    }
            
DarrenB
  • 55
  • 3
  • I'm a little confused by your example code. As written, it would be generating identical output for File1, File2, and File3. What exactly is changing between executions? Are you always loading the same `$file` and just using different `$refs` to split the file up differently? – Bacon Bits Jan 03 '23 at 16:47

2 Answers2

4

The performance issues with your script are:

  • You're iterating multiple times through the Orig.csv input file.
    It is faster to create a hashtable with your files and the specific $refs values to test for.
  • You're reopening multiple times the output file to append to it.
    It is faster to use (3) stepablepipelines for this, see: What good does a SteppablePipeline.
    Note that a correctly setup PowerShell pipeline had a low memory usage.
  • The -eq operator is probably slightly faster than the -like operator
    (apparently you don't need the -like operator as you don't have any wildcards in the values.)
    You might simply depend on the common comparison operator feature:

When the input of an operator is a scalar value, the operator returns a Boolean value. When the input is a collection, the operator returns the elements of the collection that match the right-hand value of the expression. If there are no matches in the collection, comparison operators return an empty array.

$workdir = ...
$refs =@{
    "File1.csv" = 'COLMABQ140','COLMABQ141','COLMABQ142','COLMABQ143','COLMABQ144','COLMABQ176','COLMABQ177','COLMABQ178','COLMABQ179','COLMABQ180'
    "File2.csv" = 'COLMABP140','COLMABP141','COLMABP142','COLMABP143','COLMABP144','COLMABP176','COLMABP177','COLMABP178','COLMABP179','COLMABP180'
    "File3.csv" = 'COLMABS140','COLMABS141','COLMABS142','COLMABS143','COLMABS144','COLMABS176','COLMABS177','COLMABS178','COLMABS179','COLMABS180'
}

$Pipelines = @{}
Import-csv .\Orig.csv |ForEach-Object -Begin {
    foreach ($file in $refs.keys) {
        $Pipelines[$file] = { Export-CSV -notype -Path "$workdir\$file" }.GetSteppablePipeline()
        $Pipelines[$file].Begin($True)
    }
} -Process {
    foreach ($file in $refs.keys) {
        if ($refs[$file] -eq $_.'Campaign Ref Code') { $Pipelines[$file].Process($_) }
    }
} -End {
    foreach ($file in $refs.keys) {
        $Pipelines[$file].End()
    }
}

Mastering the (steppable) pipeline
The power of the PowerShell Pipeline is often misunderstood and underestimated (especially by experienced programmers of languages that don't have a similar feature). Therefore, I have written a PowerShell community blogpost: Mastering the (steppable) pipeline.

iRon
  • 20,463
  • 10
  • 53
  • 79
  • 1
    ha! that's insanely clever. loving this answer already – Santiago Squarzon Jan 03 '23 at 18:44
  • 1
    This is absolutely fantastic & EXACTLY what I was looking for. I ran the 2 separate scripts against the same file file locally on my PC. My original way took one hour 13 minutes. Your way took...48 seconds!!! Thanks you so much for all your help on this & introducing me to concepts I wasn't even aware of (steppable pipelines). I'm quite new to Powershell & am coming from a bash / shell background so I'm still at the "looking for powershell equivalents of bash commands" step of scripting. You've been absolutely invaluable. Thank you again. – DarrenB Jan 04 '23 at 10:22
1

I've posted a comment above asking for clarification because your example code is a bit confusing. At this point I'm going to assume that $refs changes multiple times, but each day you run this you never need more than one $file and you're also always filtering on the Campaign Ref Code column.

The first big issue is that you're parsing and enumerating the file once for each Campaign Ref Code. That's going to be quite expensive. You can also speed up execution quite a bit by grouping the CSV file into a hash table, which will make lookups very fast. You're also appending to the same CSV file repeatedly, which is also fairly expensive.

# Load the file ONCE and group it by the Campaign Ref Code into a hashtable
$csv = Import-Csv $file | Group-Object -Property 'Campaign Ref Code' -AsHashTable

$newfile = "File1.$today.csv"
$refs = @('COLMABQ140','COLMABQ141','COLMABQ142','COLMABQ143','COLMABQ144','COLMABQ176','COLMABQ177','COLMABQ178','COLMABQ179','COLMABQ180')

$csv[$refs] | Export-CSV -NoTypeInformation -Path $workdir\$newfile


$newfile = "File2.$today.csv"
$refs = @('COLMABQ140','COLMABQ141','COLMABQ142','COLMABQ143','COLMABQ144','COLMABQ176','COLMABQ177','COLMABQ178','COLMABQ179','COLMABQ180')

$csv[$refs] | Export-CSV -NoTypeInformation -Path $workdir\$newfile


$newfile = "File3.$today.csv"
$refs = @('COLMABQ140','COLMABQ141','COLMABQ142','COLMABQ143','COLMABQ144','COLMABQ176','COLMABQ177','COLMABQ178','COLMABQ179','COLMABQ180')

$csv[$refs] | Export-CSV -NoTypeInformation -Path $workdir\$newfile

The real limitation of this method is that your $refs need to be precisely correct. Leading and trailing spaces need to match exactly, if there were any. You can't use globbing patterns like you could with the -like operator. It doesn't look like you're actually doing that, though, so it probably doesn't actually matter. It will also output the file in a different order than your method does. If you need them in the order of the $refs variable, you'll need to change it to something like the code below, but I am not sure that that is guaranteed to create precisely the same order.

$csv[$refs] | 
    Sort-Object -Property @{e={ $refs.IndexOf($_.'Campaign Ref Code') }} |
    Export-CSV -NoTypeInformation -Path $workdir\$newfile

Note that the Group-Object command's performance is improved in Powershell v5.1 compared to prior versions, and it's very significantly improved in Powershell v6+.

Bacon Bits
  • 30,782
  • 5
  • 59
  • 66
  • 1
    group-object -ashashtable I didn't know you could do that even in 5.1 – js2010 Jan 03 '23 at 17:25
  • @js2010 It's existed for quite some time, but it has had major performance issues in the past. In my experience, if you're on Powershell v5.1+ (which everyone should be at this point) then it's typically acceptable. Custom code to build the hashtable (or dictionary) can still perform better and I do still use that in some of my scripts, but it's significantly more code to write. – Bacon Bits Jan 03 '23 at 17:41
  • Thank you very much for taking the time to look at this for me Bacon Bits. Even though I've used iRon's solution (see above)I'll be investigating the points you raise in this as the more I know, the better. – DarrenB Jan 04 '23 at 10:25