PowerShell: Delete similar lines from file

Question

Consider the file tbl.txt (1.5 million lines), built like:

Num1 ; Num2 ; 'Value' ; 'Attribute'

So tbl.txt looks like:

  63 ; 193 ; 'Green'  ; 'Color'
 152 ; 162 ; 'Tall'   ; 'Size'
 230 ; 164 ; '130lbs' ; 'Weight'
 249 ; 175 ; 'Green'  ; 'Color'      *duplicate on 'Value' and 'Attribute'*
 420 ; 178 ; '8'      ; 'Shoesize'
 438 ; 172 ; 'Tall'   ; 'Size'       *duplicate on 'Value' and 'Attribute'*

How can i keept the first unique line on 'Value' and 'Attribute' and delete following duplicate lines on 'Value' and 'Attribute' ?

The result should look like:

  63 ; 193 ; 'Green'  ; 'Color'
 152 ; 162 ; 'Tall'   ; 'Size'
 230 ; 164 ; '130lbs' ; 'Weight'
 420 ; 178 ; '8'      ; 'Shoesize'

Any help is much appreciated.

What have you tried, and how has what you've tried failed? Ideally, you should provide a [MCVE] of what you've tried, and include specific information on how it failed, with error messages and/or erroneous output. SO is not a code-writing service; the best questions are those which provide useful information so that those who answer can guide you to devising your own correct answer. See [Ask]. — Jeff Zeitlin, Oct 17 '17 at 11:20
Use the search box above and take a look at some of the existing questions about unique values, they should help point you in the right direction. Like this one: [Powershell - filtering for unique values](//stackoverflow.com/q/9825060) — henrycarteruk, Oct 17 '17 at 11:26
Initially I was looking to modifying `cat tbl.txt | Get-Unique` accordingly but didn't find a solution. @James I can't see yet how I can turn `Foreach-Object { $_.Substring(0,2) } | Select-Object -unique` into a proper solution as the length of the line varies. — SamNorton, Oct 17 '17 at 11:58
How can I modify `cat tbl.txt | Group { $_.Substring(10,15) } | select Name | sort-object -Property Name -Unique Select-Object -unique` So that the Value of Group is returned? — SamNorton, Oct 17 '17 at 12:12
I'll point you in the direction of `Import-CSV`, with the Delimiter & Header params. — henrycarteruk, Oct 17 '17 at 15:01

score 1 · Answer 1 · answered Oct 18 '17 at 07:53

Assuming that your data is without headers:

Import-CSV "C:\folder\data.txt" –Delimiter ";" -Header Num1,Num2,Value,Attribute | Sort-Object -Property Value -Unique

Gives your desired output:

Num1 Num2 Value     Attribute 
---- ---- -----     --------- 
230  164  '130lbs'  'Weight'
420  178  '8'       'Shoesize'
63   193  'Green'   'Color'
152  162  'Tall'    'Size'

You can use Export-CSV to export your results:

Import-CSV "C:\folder\data.txt" –Delimiter ";" -Header Num1,Num2,Value,Attribute | Sort-Object -Property Value -Unique | Export-CSV "C:\folder\data2.txt" –Delimiter ";" -NoTypeInformation

davidhigh · Accepted Answer · 2017-10-17T17:29:05.017

Loop over the text-file via Get-Content, separate the columns 'Value' ; 'Attribute' through string operations, and then use a hashmap in order to check whether you already processed a similar line -- if not, output the line once. In code:

$map = @{};
Get-Content tbl.txt | ` 
             %{ $key = $_.Substring($_.IndexOf(';',$_.IndexOf(';')+1)+1); `
                If(-not $map.ContainsKey($key)) { $_; $map[$key] = 1 } `
              }

Alternatively, as mentioned in the comments, you can use group and apply the same substring as grouping criterium, and finally take the first element of each group:

Get-Content tbl.txt | group {$_.Substring($_.IndexOf(';',$_.IndexOf(';')+1)+1)} `
                    | %{$_.Group[0]}

PowerShell: Delete similar lines from file

2 Answers2