I have a requirement to leverage PowerShell for a problem where I have a large dataset contained within a CSV. I need to read a CSV into memory and process removing all the duplicates from the CSV.
The primary problem with this outside of using PowerShell, running things in memory...etc, is that I have to evaluate certain columns to identify duplicates not entire rows.
In addition I need to keep the oldest duplicate entry based on a column that contains a first observed date.
I have tried a few different things such as sort-object with unique designation.
The dataset within the CSV typically contains 1-5 million rows and the column looks similar to:
"LastObserved","FirstObserved","ComputerName","ComputerID","Virtual","ComputerSerialID"
function Request-Dedupe ($data) {
try {
Write-Log -Message "Cycling through data to remove duplicates"
$dedupe_data = @()
$i = 0
$n = 0
foreach ($obj in $data |Sort-Object -Property FirstObserved) {
if ($obj.ComputerSerialID -notin $dedupe_data.ComputerSerialID -and $obj.ComputerID -notin $dedupe_data.ComputerID) {
$dedupe_data += $obj
if ($current_data.ComputerID -contains $obj.ComputerID) {
$dedupe_data[$n].LastObserved = $current_time
}
$n ++
}
Write-Progress -Activity "Cycling through data to remove duplicates and correlate first observed time" -Status "$i items processed" -PercentComplete ([Double]$i / $data.count*100)
$i ++
}
Write-Log -Message "Dedupe Complete"
return $dedupe_data
} catch {
Write-Log -Level Error "Unable to sort and dedupe data"
}
}
$current_time = (Get-Date).ToUniversalTime().ToString("yyyy-MM-ddTHH:mm:ss")
$current_data = Import-Csv .\UniqueSystems.csv
$test = Request-Dedupe $current_data
My goal would to speed up the above maybe leverage C#.
The expected output will remove all the duplicates from the CSV keeping the oldest "FirstObserved" date for each duplicate found.