Increase performance of PowerShell function removing duplicates from CSV

Question

I have a requirement to leverage PowerShell for a problem where I have a large dataset contained within a CSV. I need to read a CSV into memory and process removing all the duplicates from the CSV.

The primary problem with this outside of using PowerShell, running things in memory...etc, is that I have to evaluate certain columns to identify duplicates not entire rows.

In addition I need to keep the oldest duplicate entry based on a column that contains a first observed date.

I have tried a few different things such as sort-object with unique designation.

The dataset within the CSV typically contains 1-5 million rows and the column looks similar to:

"LastObserved","FirstObserved","ComputerName","ComputerID","Virtual","ComputerSerialID"

function Request-Dedupe ($data) {
    try {
        Write-Log -Message "Cycling through data to remove duplicates"

        $dedupe_data = @()
        $i = 0
        $n = 0
        foreach ($obj in $data |Sort-Object -Property FirstObserved) {
            if ($obj.ComputerSerialID -notin $dedupe_data.ComputerSerialID -and $obj.ComputerID -notin $dedupe_data.ComputerID) {
                $dedupe_data += $obj
                if ($current_data.ComputerID -contains $obj.ComputerID) {
                   $dedupe_data[$n].LastObserved = $current_time
                }
                $n ++
            }
            Write-Progress -Activity "Cycling through data to remove duplicates and correlate first observed time" -Status "$i items processed" -PercentComplete ([Double]$i / $data.count*100)
            $i ++
        }

        Write-Log -Message "Dedupe Complete"
        return $dedupe_data
    } catch {
        Write-Log -Level Error "Unable to sort and dedupe data"
    }
}
$current_time = (Get-Date).ToUniversalTime().ToString("yyyy-MM-ddTHH:mm:ss")
$current_data = Import-Csv .\UniqueSystems.csv
$test = Request-Dedupe $current_data

My goal would to speed up the above maybe leverage C#.

The expected output will remove all the duplicates from the CSV keeping the oldest "FirstObserved" date for each duplicate found.

Ansgar Wiechers · Answer 1 · 2019-07-07T09:32:15.093

For improving performance you should avoid appending to an array as well as doing lookups in arrays. Both are slow operations.

If I understand your question correctly you want to keep the unique records with the same "ComputerID" and "ComputerSerialID" and the oldest "FirstObserved" value. That can be achieved using a hashtable like this:

$unique = @{}
Import-Csv .\UniqueSystems.csv | ForEach-Object {
    $key = '{0}/{1}' -f $_.ComputerID, $_.ComputerSerialID
    if (-not $unique.Contains($key) -or $unique[$key].FirstObserved -gt $_.FirstObserved) {
        $unique[$key] = $_
    }
}
$unique.Values

Increase performance of PowerShell function removing duplicates from CSV

1 Answers1