2

I have a huge csv file with about 13M rows and about 50 columns (File #1). I have another file with about 11k rows that's a list of IP addresses (File #2), which is also one of the 50 columns in the first file. How do I go about filtering File #1 so that the output contains only those rows where IP addresses from File #2 is found?

Here's what I've tried so far but it's been running for 12 hours and counting:

$IP = Get-Content -Path C:\Documents\File2.txt

Import-Csv C:\Documents\File1.csv | Where-Object {$_.IP -eq $IP} | Export-csv -Path C:\Documents\File3.csv -NoTypeInformation
LucaS
  • 267
  • 1
  • 2
  • 17
RickL
  • 21
  • 1
  • 2
    `Get-Content` returns an array of strings. If you want to check if `$_.IP` is a value contained in that array, use `-in` instead of `-eq`, or do `Where-Object { $IP -contains $_.$IP }`. With such huge files, the code wil take its time though.. Maybe sorting the ip array from File2.txt helps speding it up a bit? – Theo Nov 02 '18 at 14:05
  • 2
    Could be worth taking a small sample of your files to ensure you get the logic working before working with that much data. Would be much quicker to determine if it is working or now. – Clayton Lewis Nov 02 '18 at 14:07
  • +1 to @Theo 's comment. But also ... 13m x 11k = 143,000,000,000. That seems like it's reaching "crashing ps" levels... Maybe try something like this: https://www.spjeff.com/2017/06/02/powershell-split-csv-in-1000-line-batches/ break your csv into a few of more managable sizes. Maybe in the least you can run 13x 1m files. Then append them up after the fact. I can make a write up if you guys think it's a good idea. – Jacob Colvin Nov 02 '18 at 14:29
  • He's "streaming" the csv, so he won't have it all in memory. If he was using a foreach loop it would be different. Doesn't mean this way will be fast, though. :-) – Mike Shepard Nov 02 '18 at 14:34
  • I didn't realize that was how it worked @MikeShepard . I guess that's a major reason to keep everything in the pipeline rather than assigning it all the time? – Jacob Colvin Nov 02 '18 at 15:05
  • Absolutely. By piping the output, you don't have to materialize the entire collection and can process it a row at a time. – Mike Shepard Nov 02 '18 at 15:09

1 Answers1

4

You can make this script to run much faster:

  • No need to use Import-Csv/Export-csv. Reading/Writng lines is sufficient and much faster
  • $_.ip -in $IP is inefficient. Use hashtable for look ups (it will be about instant)
  • Use .net tools rather than built in cmdlets

Below is the script with optimization I mentioned. Check commented lines before running

$inFile = "C:\stack\IpTables\Data.txt"
$IPfile =  "C:\stack\IpTables\IPs.txt"
$outFile = "C:\stack\IpTables\OutData.txt"
$ipIndexInData = 47 #index of IP address column in your data file

#build a hashtable for IP look up. 
$hash = @{}
[System.IO.File]::ReadAllLines($IPfile) | foreach {$hash.Add($_, $true)}
# if IP values in your list are not unique then wrap $hash.Add() with try/catch


$fsIn = [System.IO.StreamReader]$inFile
$fsOut = [System.IO.StreamWriter]$outFile

$fsOut.WriteLine($fsIn.ReadLine()) # this will write first row with column names to out file. Comment it out if first row is data row

while (!$fsIn.EndOfStream) {

 $line = $fsIn.ReadLine()
 $row = $line -split ","

 if($hash[$row[$ipIndexInData].Trim('"')]) { # remove .Trim('"') if values in your data file are not quoted with "

    $fsOut.WriteLine($line)
 }

}
Mike Twc
  • 2,230
  • 2
  • 14
  • 19
  • I would do the same : use system.io to read the file line by line and put the stuff in a hashtable since these are extremely fast with lookups. – bluuf Nov 02 '18 at 17:19
  • Another .net tool to consider to improve performance: [LINQ](https://stackoverflow.com/q/38360545/1701026) – iRon Nov 02 '18 at 17:19
  • This part `$row = $line -split ","` can break up badly if the data in the csv has commas inside quoted fields and since we don't know what is in the rest of the (50+) columns.. You need something like this: `$row = $line -split '\s*,\s*(?!(?<=(?:^|,)\s*"(?:[^"]|""|\\")*,\s*)(?:[^"]|""|\\")*"\s*(?:,|$))'`. This splits comma separated lists of optionally quoted strings. It handles quoted delimiters `","` and escaped quotes `\"`. Whitespace inside quotes is preserved, outside it is eaten. – Theo Nov 03 '18 at 10:31
  • Another way of looking up values quickly could be read the IP addresses into an Arraylist `$IPlist = [System.Collections.ArrayList](Get-Content "C:\Documents\File2.txt")` and then use the [ArrayList.BinarySearch Method](https://learn.microsoft.com/en-us/dotnet/api/system.collections.arraylist.binarysearch?view=netframework-4.7.2) to find the value. This requires the list of IP addresses to be **sorted**. Don't know if it can compeat with using a Hashtable.. – Theo Nov 03 '18 at 10:41