Performance of PowerShell script reading files is too slow

Question

I'm currently working on a PowerShell script that is going to be used in TeamCity as part of a build step. The script has to:

recursively check all files with a certain extension (.item) within a folder,
read the third line of each file (which contains a GUID) and check if there are any duplicates in these lines,
log the path of the file that contains the duplicate GUID and log the GUID itself,
make the TeamCity build fail if one or more duplicates are found

I am completely new to PowerShell scripts, but so far I've made something that does what I expect it to do:

Write-Host "Start checking for Unicorn serialization errors."

$files = get-childitem "%system.teamcity.build.workingDir%\Sitecore\serialization" -recurse -include *.item | where {! $_.PSIsContainer} | % { $_.FullName }
$arrayOfItemIds = @()
$NrOfFiles = $files.Length
[bool] $FoundDuplicates = 0

Write-Host "There are $NrOfFiles Unicorn item files to check."

foreach ($file in $files)
{
    $thirdLineOfFile = (Get-Content $file)[2 .. 2]

    if ($arrayOfItemIds -contains $thirdLineOfFile)
    {
        $FoundDuplicates = 1
        $itemId = $thirdLineOfFile.Split(":")[1].Trim()

        Write-Host "Duplicate item ID found!"
        Write-Host "Item file path: $file"
        Write-Host "Detected duplicate ID: $itemId"
        Write-Host "-------------"
        Write-Host ""
    }
    else
    {
        $arrayOfItemIds += $thirdLineOfFile
    }
}

if ($foundDuplicates)
{
    "##teamcity[buildStatus status='FAILURE' text='One or more duplicate ID's were detected in Sitecore serialised items. Check the build log to see which files and ID's are involved.']"
    exit 1
}

Write-Host "End script checking for Unicorn serialization errors."

The problem is: it's very slow! The folder that has to be checked by this script currently contains over 14.000 .item-files and it's very likely that that amount will only keep increasing in the future. I understand that opening and reading so many files is an extensive operation, but I didn't expect it to take approximately half an hour to complete. This is way too long, because it would mean the build time for every (snapshot) build would be lengthened by half an hour, which is unacceptable. I had hoped the script would complete in a couple of minutes at max.

I can't possibly believe that there isn't a faster approach to do this.. so any help in this area is greatly appreciated!

Solution

Well I have to say that all 3 answers I received so far have helped me out in this one. I first started with using the .NET framework classes directly and then used the dictionary as well to solve the growing array problem. The time it took to run my own script was about 30 minutes, then that went down to just 2 minutes by using the .NET framework classes. After using the Dictionary solution as well it went down to just 6 or 7 seconds! The final script that I use:

Write-Host "Start checking for Unicorn serialization errors."

[String[]] $allFilePaths = [System.IO.Directory]::GetFiles("%system.teamcity.build.workingDir%\Sitecore\serialization", "*.item", "AllDirectories")
$IdsProcessed = New-Object 'system.collections.generic.dictionary[string,string]'
[bool] $FoundDuplicates = 0
$NrOfFiles = $allFilePaths.Length

Write-Host "There are $NrOfFiles Unicorn item files to check."
Write-Host ""

foreach ($filePath in $allFilePaths)
{
    [System.IO.StreamReader] $sr = [System.IO.File]::OpenText($filePath)
    $unused1 = $sr.ReadLine() #read the first unused line
    $unused2 = $sr.ReadLine() #read the second unused line
    [string]$thirdLineOfFile = $sr.ReadLine()
    $sr.Close()

    if ($IdsProcessed.ContainsKey($thirdLineOfFile))
    {
        $FoundDuplicates = 1
        $itemId = $thirdLineOfFile.Split(":")[1].Trim()
        $otherFileWithSameId = $IdsProcessed[$thirdLineOfFile]

        Write-Host "---------------"
        Write-Host "Duplicate item ID found!"
        Write-Host "Detected duplicate ID: $itemId"
        Write-Host "Item file path 1: $filePath"
        Write-Host "Item file path 2: $otherFileWithSameId"
        Write-Host "---------------"
        Write-Host ""
    }
    else
    {
        $IdsProcessed.Add($thirdLineOfFile, $filePath)
    }
}

if ($foundDuplicates)
{
    "##teamcity[buildStatus status='FAILURE' text='One or more duplicate ID|'s were detected in Sitecore serialised items. Check the build log to see which files and ID|'s are involved.']"
    exit 1
}

Write-Host "End script checking for Unicorn serialization errors. No duplicate ID's were found."

So thanks to all!

You are probably file IO-bandwidth limited here. So the major amount of time is (probably) being used in dragging files in off the disk. If that is the case (and a back of the envelope calculation should be able to confirm it) then the easiest way to speed it up would be to move to faster storage - like SSD. — Mike Wise, Mar 26 '16 at 14:56
@Mike Wise: It could have been, although half an hour for even copying 14000 files would be excessive on any modern machine. I think in this case file IO is not an issue at all. Since his code is "part of a build step", all files will have been dealt with or created just before his code kicks in and will therefor likely all still be in the disk cache. — Martin Maat, Mar 26 '16 at 19:01

score 5 · Answer 1 · answered Mar 26 '16 at 15:00

5

Try replacing Get-Content with [System.IO.File]::ReadLines. In case this is still too slow consider using System.IO.StreamReader - this would cause you to write a bit more code but would allow you to just read the first 3 lines.

answered Mar 26 '16 at 15:00

DAXaholic

33,312
6
76
74

Martin Maat · Accepted Answer · 2016-03-26T16:56:42.233

It isn't clear what PowerShell does exactly when you use high level commands like Get-ChildItem and Get-Content. So I would be more explicit about it and use the .NET framework classes directly.

Get the paths of the files in your folder using

[String[]] $files = [System.IO.Directory]::GetFiles($folderPath, "*.yourext")

Then, rather than using Get-Content, open each file and read the first three lines. Like so:

[System.IO.StreamReader] $sr = [System.IO.File]::OpenText(path)
[String]$line = $sr.ReadLine()
while ($line -ne $null)
{
  # do your thing, break when you know enough
  # ...
  [String]$line = $sr.ReadLine()
}
$sr.Close()

I may have made a mistake or two, I am too lazy get up and test this on a PC.

And you may want to consider redesigning your build system to use less files. 14000 files and growing seems unnecessary. If you can consolidate some data in less files, it may also help performance a lot.

For the check for duplicate guids, use a Dictionary<Guid, String> class with the string being your file name. Then you can report where the duplicates are if you find any.

We are using this script on a Sitecore project. When you are developing for Sitecore you mainly build C# code, but some parts of the 'development' has to be done in the CMS (ie: in a database). Because of this, our database is automatically serialized by a tool named Unicorn. It stores each 'item' in the CMS to a serialized file on disk. That way, all database development is stored in Git as well and is easy to distribute to other developers. It works pretty well, but the downside of this is the huge amount of files... but there is no other way of doing this so redesigning is not an option. — Niles11, Mar 26 '16 at 19:46

score 1 · Answer 3 · answered Mar 26 '16 at 16:30

I think your problem might be caused by your Array, and is probably not a file read problem.

The size of an array in PowerShell is immutable, so every time you add an item to the array, it creates a new array and copies all the items.
Your array will usually NOT contain the value are looking up, and will have to compare $thirdLineOfFile to every item in a growing array.

I have been using .Net Dictionaries to solve this problem. (or ArrayLists when I am not doing a lot of lookups) MSDN Dictionary Reference

Note: PowerShell provides a Cmdlet called 'Measure-Command' that you can use to determine which part of your script is actually running slowly. I would test the file read time and time to grow the array and lookup values. Depending on the size of the files, you may actually have performance issues there too.

Here is your code adapted to use a .Net Dictionary. I renamed your variable, since it is not an array anymore.

Write-Host "Start checking for Unicorn serialization errors."

$files = get-childitem "%system.teamcity.build.workingDir%\Sitecore\serialization" -recurse -include *.item | where {! $_.PSIsContainer} | % { $_.FullName }
#$arrayOfItemIds = @()
$IdsProcessed = New-Object 'system.collections.generic.dictionary[string,string]' # A .Net Dictionary will be faster for inserts and lookups.
$NrOfFiles = $files.Length
[bool] $FoundDuplicates = 0

Write-Host "There are $NrOfFiles Unicorn item files to check."

foreach ($file in $files)
{
    $thirdLineOfFile = (Get-Content -path $file -TotalCount 3)[2] # TotalCount param will let us pull in just the beginning of the file.

    #if ($arrayOfItemIds -contains $thirdLineOfFile)
    if($IdsProcessed.ContainsKey($thirdLineOfFile))
    {
        $FoundDuplicates = 1
        $itemId = $thirdLineOfFile.Split(":")[1].Trim()

        Write-Host "Duplicate item ID found!"
        Write-Host "Item file path: $file"
        Write-Host "Detected duplicate ID: $itemId"
        Write-Host "-------------"
        Write-Host ""
    }
    else
    {
        #$arrayOfItemIds += $thirdLineOfFile
        $IdsProcessed.Add($thirdLineOfFile,$null) 
    }
}

if ($foundDuplicates)
{
    "##teamcity[buildStatus status='FAILURE' text='One or more duplicate ID's were detected in Sitecore serialised items. Check the build log to see which files and ID's are involved.']"
    exit 1
}

Write-Host "End script checking for Unicorn serialization errors."

Performance of PowerShell script reading files is too slow

3 Answers3