2

I have a script that goes through HTTP access log, filters out some lines based on a regex patern and copies them into another file:

param($workingdate=(get-date).ToString("yyMMdd"))
Get-Content "access-$workingdate.log" | 
Select-string -pattern $pattern | 
Add-Content "D:\webStatistics\log\filtered-$workingdate.log"

My logs can be quite large (up to 2GB), which takes up to 15 minutes to run. Is there anything I can to do improve the performance of the statement above?

Thank you for your thoughts!

Predrag Vasić
  • 341
  • 1
  • 4
  • 14
  • Related and useful: http://stackoverflow.com/questions/31674667/is-there-a-way-to-optimise-my-powershell-function-for-removing-pattern-matches-f – Zan Lynx Aug 14 '15 at 00:50

3 Answers3

4

See if this isn't faster than your current solution:

param($workingdate=(get-date).ToString("yyMMdd"))
Get-Content "access-$workingdate.log" -ReadCount 2000 |
 foreach { $_ -match $pattern | 
  Add-Content "D:\webStatistics\log\filtered-$workingdate.log"
 }
mjolinor
  • 66,130
  • 7
  • 114
  • 135
  • This code made it at least five times as fast (perhaps -ReadCount 2000 helped). Thank you! – Predrag Vasić Aug 14 '15 at 14:28
  • I thought it would. You're gaining efficiency with both the -ReadCount and then by using -match as an array operator, essentially reading and then filtering 2000 records at a time, instead of 1. There may be some more to be had by optimizing the pattern matching, either by optimizing the regex or if possible switching to wildcard pattern matching and using -like. – mjolinor Aug 14 '15 at 14:44
3

You don't show your patterns, but I suspect they are a large part of the problem.

You will want to look for a new question here (I am sure it has been asked) or elsewhere for detailed advice on building fast regular expression patterns.

But I find the best advice is to anchor your patterns and avoid runs of unknown length of all characters.

So instead of a pattern like path/.*/.*\.js use one with a $ on the end to anchor it to the end of the string. That way the regex engine can tell immediately that index.html is not a match. Otherwise it has to do some rather complicated scans with path/ and .js possibly showing up anywhere in the string. This example of course assumes the file name is at the end of the log line.

Anchors work well with start of line patterns as well. A pattern might look like ^[^"]*"GET /myfile" That has a unknown run length but at least it knows that it doesn't have to restart the search for more quotes after finding the first one. The [^"] character class allows the regex engine to stop because the pattern can't match after the first quote.

Zan Lynx
  • 53,022
  • 10
  • 79
  • 131
  • The script goes through a loop for nine different patterns, generating nine separate output files. Some patterns are as simple as `" /esa/ffd/"` (just looking for log entries that contain this string); others are a bit more complex `"( /(jsummit|islands|esa)/(agd11|esummit|gite|terre|susdv|dsd))"`. Essentially, all involve identifying log entries that belong to a specific division here, and since the site isn't perfectly hierarchically organised, we need some logical operators in regex. – Predrag Vasić Aug 14 '15 at 14:34
0

You could also try seeing if using streams would speed it up. Something like this might help, although I couldn't test it because, as mentioned above, I'm not sure what patter you are using.

param($workingdate=(get-date).ToString("yyMMdd"))

$file = New-Object System.IO.StreamReader -Arg "access-$workingdate.log"
$stream = New-Object System.IO.StreamWriter -Arg "D:\webStatistics\log\filtered-$workingdate.log"

while ($line = $file.ReadLine()) {
    if($line -match $pattern){
        $stream.WriteLine($line)    
    }
}
$file.close()
$stream.Close()
campbell.rw
  • 1,366
  • 12
  • 22