4

I have a setup that contains 7 million XML files, varying in size from a few KB to multiple MB. All in all, it's about 180GB of XML files. The job I need performed is to analyze each XML file and determine if the file contains the string <ref>, and if it does not to move it out of the Chunk folder that it currently is contained in to the Referenceless folder.

The script I have created works well enough, but it's extremely slow for my purposes. It's slated to finish analyzing all 7 million files in about 24 days, going at a rate of about 3 files per second. Is there anything I can change in my script to eek out more performance?

Also, to make matters even more complicated, I do not have the correct permissions on my server box to run .PS1 files, and so the script will need to be able to be run from the PowerShell in one command. I would set the permissions if I had the authorization to.

# This script will iterate through the Chunk folders, removing pages that contain no 
# references and putting them into the Referenceless folder.

# Change this variable to start the program on a different chunk. This is the first   
# command to be run in Windows PowerShell. 
$chunknumber = 1
#This while loop is the second command to be run in Windows PowerShell. It will stop after completing Chunk 113.
while($chunknumber -le 113){
#Jumps the terminal to the correct folder.
cd C:\Wiki_Pages
#Creates an index for the chunk being worked on.
$items = Get-ChildItem -Path "Chunk_$chunknumber"
echo "Chunk $chunknumber Indexed"
#Jumps to chunk folder.
cd C:\Wiki_Pages\Chunk_$chunknumber
#Loops through the index. Each entry is one of the pages.
foreach ($page in $items){
#Creates a variable holding the page's content.
$content = Get-Content $page
#If the page has a reference, then it's echoed.
if($content | Select-String "<ref>" -quiet){echo "Referenced!"}
#if the page doesn't have a reference, it's copied to Referenceless then deleted.
else{
Copy-Item $page C:\Wiki_Pages\Referenceless -force
Remove-Item $page -force
echo "Moved to Referenceless!"
}
}
#The chunk number is increased by one and the cycle continues.
$chunknumber = $chunknumber + 1
}

I have very little knowledge of PowerShell, yesterday was the first time I had ever even opened the program.

user55397
  • 41
  • 1
  • 2
  • 1
    Get-Content is a performance killer. Look into .net streamreader. It will improve your performance dramatically. – Gisli Jul 01 '12 at 21:40

5 Answers5

4

You will want to add the -ReadCount 0 argument to your Get-Content commands to speed them up (it helps tremendously). I learned this tip from this great article that shows running a foreach over a whole file's contents is faster than trying to parse it through a pipeline.

Also, you can use Set-ExecutionPolicy Bypass -Scope Process in order to run scripts in your current Powershell session, without needing extra permissions!

Burkart
  • 462
  • 4
  • 9
SpellingD
  • 2,533
  • 1
  • 17
  • 13
2

The PowerShell pipeline can be markedly slower than native system calls.

PowerShell: pipeline performance

In this article a performance test is performed between two equivalent commands executed on PowerShell and a classical windows command prompt.

PS> grep [0-9] numbers.txt | wc -l > $null
CMD> cmd /c "grep [0-9] numbers.txt | wc -l > nul"

Here's a sample of its output.

PS C:\temp> 1..5 | % { .\perf.ps1 ([Math]::Pow(10, $_)) }

10 iterations

   30 ms  (   0 lines / ms)  grep in PS
   15 ms  (   1 lines / ms)  grep in cmd.exe

100 iterations

   28 ms  (   4 lines / ms)  grep in PS
   12 ms  (   8 lines / ms)  grep in cmd.exe

1000 iterations

  147 ms  (   7 lines / ms)  grep in PS
   11 ms  (  89 lines / ms)  grep in cmd.exe

10000 iterations

 1347 ms  (   7 lines / ms)  grep in PS
   13 ms  ( 786 lines / ms)  grep in cmd.exe

100000 iterations

13410 ms  (   7 lines / ms)  grep in PS
   22 ms  (4580 lines / ms)  grep in cmd.exe

EDIT: The original answer to this question mentioned pipeline performance along with some other suggestions. To keep this post succinct I've removed the other suggestions that didn't actually have anything to do with pipeline performance.

Sean Glover
  • 1,766
  • 18
  • 31
  • First, thanks a lot for answering. I couldn't get grep to run in my Powershell, it just says it's an unknown command. After looking around, apparently the alias might be where, but the information seems contradictory and I can't figure out how to get where to work. You're right that it was inefficient to create a variable and then pipe the variable over, but after testing it out there was nearly no difference in speed between making the variable and not making the variable. The pipeline performance is very interesting, I'll see if I can change my code to remove it. – user55397 Jun 30 '12 at 16:31
  • 2
    Yup, it was the pipeline process. On a control group of 100 files, the previous script took 8 seconds, and the pipeline-less script took ~600 milliseconds. – user55397 Jun 30 '12 at 16:39
  • Interesting. Try using the findstr native call. I've updated my answer. – Sean Glover Jun 30 '12 at 16:45
  • findstr took about 3 seconds to do 100 files, my non-pipeline method took about 1 second to do the same files. They weren't the same files that I had previously used. Also, findstr did not manage to properly search through all of the files, for some reason it did not move 7 of the 100 files even though all of the files failed to contain the string ``. To clarify, my method is: `if(Select-String $page -pattern "" -quiet){echo "Referenced!"} ` – user55397 Jun 30 '12 at 17:23
  • Odd, maybe the contained a line break or other non-printable character? findstr supports regular expressions too, so you may want to try that: http://msdn.microsoft.com/en-us/subscriptions/cc755405%28v=ws.10%29.aspx . I'm not sure I understand your solution. From what you said in your last comment it appears that you are still using Select-String within the bounds of your PowerShell script? – Sean Glover Jun 30 '12 at 18:44
  • Yes, I am still using the Select-String method. What I found was that using the pipeline to feed Select-String a variable was extremely inefficient compared to just letting Select-String take the $page out of the array itself. In short, the PowerShell pipeline is really, really slow. – user55397 Jul 01 '12 at 20:39
0

Before you start optimizing, you need to determine exactly where you need to optimize. Are you I/O bound (how long it takes to read each file)? Memory bound (likely not)? CPU bound (time to search the content)?

You say these are XML files; have you tested reading the files into an XML object (instead of plain text), and locating the <ref> node via XPath? You would then have:

$content = [xml](Get-Content $page)
#If the page has a reference, then it's echoed.
if($content.SelectSingleNode("//ref") -quiet){echo "Referenced!"}

If you have CPU, memory & I/O resources to spare, you may see some improvement by searching multiple files in parallel. See this discussion on running several jobs in parallel. Obviously you can't run a large number simultaneously, but with some testing you can find the sweet spot (probably in the neighborhood of 3-5). Everything inside foreach ($page in $items){ would be the scriptblock for the job.

Community
  • 1
  • 1
alroc
  • 27,574
  • 6
  • 51
  • 97
  • But parsing the file as XML must be a lot slower than simply "IndexOf".. Anyway, you are right that it's needed to specify what "optimize" means. – stej Jul 01 '12 at 20:13
  • I don't have any large XML files handy to try with, but one would need to test to see if that is the case. Parsing the XML may take longer, but the search for the ref node may be faster than the linear search that IndexOf would require. In some cases, it could be a net gain. – alroc Jul 02 '12 at 10:38
  • Just out of curiosity, I tested with a 48K XML file I have here and the XML method vs. content |select-string came out with very different timings. The first time I ran with the XML method, it took 30ms. Subsequent executions (which include reading the file from a RAMdisk into an XML object) ran 11-19 ms. Using the select-string method (and again, reading from RAMdisk each time), 24ms seems to be floor, with some runs going as high as 46ms. The node I'm looking for in both cases is at the end of the file. – alroc Jul 06 '12 at 19:43
0

I would experiment with parsing 5 files at once using the Start-Job cmdlet. There are many excellent articles on PowerShell Jobs. If for some reason that doesn't help, and you're experiencing I/O or actual resource bottlenecks, you could even use Start-Job and WinRM to spin up workers on other machines.

Chris N
  • 7,239
  • 1
  • 24
  • 27
0

If you would load the xml into a variable, it is also significantly faster then Get-Content.

Measure-Command {
    $xml = [xml]''
    $xml.Load($xmlFilePath)
}

Measure-Command {
    [xml]$xml = Get-Content $xmlFilePath -ReadCount 0
}

In my measurements it's at least 4 times faster.

Bernard Moeskops
  • 1,203
  • 1
  • 12
  • 16