Import-Csv of huge (13M Rows) filtered using Where-Object with 11k rows

Question

I have a huge csv file with about 13M rows and about 50 columns (File #1). I have another file with about 11k rows that's a list of IP addresses (File #2), which is also one of the 50 columns in the first file. How do I go about filtering File #1 so that the output contains only those rows where IP addresses from File #2 is found?

Here's what I've tried so far but it's been running for 12 hours and counting:

$IP = Get-Content -Path C:\Documents\File2.txt

Import-Csv C:\Documents\File1.csv | Where-Object {$_.IP -eq $IP} | Export-csv -Path C:\Documents\File3.csv -NoTypeInformation

`Get-Content` returns an array of strings. If you want to check if `$_.IP` is a value contained in that array, use `-in` instead of `-eq`, or do `Where-Object { $IP -contains $_.$IP }`. With such huge files, the code wil take its time though.. Maybe sorting the ip array from File2.txt helps speding it up a bit? — Theo, Nov 02 '18 at 14:05
Could be worth taking a small sample of your files to ensure you get the logic working before working with that much data. Would be much quicker to determine if it is working or now. — Clayton Lewis, Nov 02 '18 at 14:07
+1 to @Theo 's comment. But also ... 13m x 11k = 143,000,000,000. That seems like it's reaching "crashing ps" levels... Maybe try something like this: https://www.spjeff.com/2017/06/02/powershell-split-csv-in-1000-line-batches/ break your csv into a few of more managable sizes. Maybe in the least you can run 13x 1m files. Then append them up after the fact. I can make a write up if you guys think it's a good idea. — Jacob Colvin, Nov 02 '18 at 14:29
He's "streaming" the csv, so he won't have it all in memory. If he was using a foreach loop it would be different. Doesn't mean this way will be fast, though. :-) — Mike Shepard, Nov 02 '18 at 14:34
I didn't realize that was how it worked @MikeShepard . I guess that's a major reason to keep everything in the pipeline rather than assigning it all the time? — Jacob Colvin, Nov 02 '18 at 15:05
Absolutely. By piping the output, you don't have to materialize the entire collection and can process it a row at a time. — Mike Shepard, Nov 02 '18 at 15:09

score 4 · Answer 1 · answered Nov 02 '18 at 17:05

4

You can make this script to run much faster:

No need to use Import-Csv/Export-csv. Reading/Writng lines is sufficient and much faster
$_.ip -in $IP is inefficient. Use hashtable for look ups (it will be about instant)
Use .net tools rather than built in cmdlets

Below is the script with optimization I mentioned. Check commented lines before running

$inFile = "C:\stack\IpTables\Data.txt"
$IPfile =  "C:\stack\IpTables\IPs.txt"
$outFile = "C:\stack\IpTables\OutData.txt"
$ipIndexInData = 47 #index of IP address column in your data file

#build a hashtable for IP look up. 
$hash = @{}
[System.IO.File]::ReadAllLines($IPfile) | foreach {$hash.Add($_, $true)}
# if IP values in your list are not unique then wrap $hash.Add() with try/catch


$fsIn = [System.IO.StreamReader]$inFile
$fsOut = [System.IO.StreamWriter]$outFile

$fsOut.WriteLine($fsIn.ReadLine()) # this will write first row with column names to out file. Comment it out if first row is data row

while (!$fsIn.EndOfStream) {

 $line = $fsIn.ReadLine()
 $row = $line -split ","

 if($hash[$row[$ipIndexInData].Trim('"')]) { # remove .Trim('"') if values in your data file are not quoted with "

    $fsOut.WriteLine($line)
 }

}

answered Nov 02 '18 at 17:05

Mike Twc

2,230
2
14
19

I would do the same : use system.io to read the file line by line and put the stuff in a hashtable since these are extremely fast with lookups. – bluuf Nov 02 '18 at 17:19
Another .net tool to consider to improve performance: [LINQ](https://stackoverflow.com/q/38360545/1701026) – iRon Nov 02 '18 at 17:19
This part `$row = $line -split ","` can break up badly if the data in the csv has commas inside quoted fields and since we don't know what is in the rest of the (50+) columns.. You need something like this: `$row = $line -split '\s*,\s*(?!(?<=(?:^|,)\s*"(?:[^"]|""|\\")*,\s*)(?:[^"]|""|\\")*"\s*(?:,|$))'`. This splits comma separated lists of optionally quoted strings. It handles quoted delimiters `","` and escaped quotes `\"`. Whitespace inside quotes is preserved, outside it is eaten. – Theo Nov 03 '18 at 10:31
Another way of looking up values quickly could be read the IP addresses into an Arraylist `$IPlist = [System.Collections.ArrayList](Get-Content "C:\Documents\File2.txt")` and then use the [ArrayList.BinarySearch Method](https://learn.microsoft.com/en-us/dotnet/api/system.collections.arraylist.binarysearch?view=netframework-4.7.2) to find the value. This requires the list of IP addresses to be **sorted**. Don't know if it can compeat with using a Hashtable.. – Theo Nov 03 '18 at 10:41

Import-Csv of huge (13M Rows) filtered using Where-Object with 11k rows

1 Answers1