Powershell: improving LDIF file to CSV conversion

Question

I have the below code to convert an LDIF file (over 100.000 lines) to a CSV file (over 4.000 lines), but I'm not sure I'm happy with the time it takes - although I don't know how long it should take really; maybe that's a normal time on my laptop (Core i5 7th Gen, 16GB RAM, SSD drive)?

Would there be any room for improvement? (especially for the parsing if possible, which takes 30 seconds)

# Reducing & editing data to process:
# -----------------------------------
$original = Get-Content $IN_ldif_file
$reduced = (($original | select-string -pattern '^cust[A-Z]','^$' -CaseSensitive).Line) -replace ':: ', ': ' -replace '^cust',''
"Writing reduced LDIF file..." # < 1 sec
(Measure-Command { Set-Content $reducedLDIF -Value $reduced -Encoding UTF8 }).TotalSeconds

# Parsing the relevant data:
# --------------------------
$inData = New-Object -TypeName System.IO.StreamReader -ArgumentList $reducedLDIF
$a = @{}                # initialize the temporary hash
$lineNum = $rcdNum = 0  # initialize the counters
"Parsing reduced LDIF file..." # 27-36 sec
(Measure-Command { 
    # Begin reading and processing the input file:
    $results = while (-not $inData.EndOfStream)
    {
        $line = $inData.ReadLine()
        Write-Verbose "$("{0:D4}" -f ++$lineNum)|$("{0:D4}|" -f $rcdNum)$line"

        if (($line -match "^\s*$") -or $inData.EndOfStream )
        {
            # blank line or end of stream - dump the hash as an object and reinit the hash
            [PSCustomObject]$a
            $a = @{}
            $rcdNum++
        } else {
            # build up hash table for the object
            $key, $value = $line -split ": "
            $a[$key] = $value
        }
    }
    $inData.Close()
}).TotalSeconds

# Populating & writing the CSV file:
# ----------------------------------
"Populating the CSV data..." # 7-11 sec
(Measure-Command { 
    $out = $results |
        select  "Attribute01",
                "Attribute02",
                "Attribute03",
                <# etc... #>
                @{n="Attribute39"; E={$_."Attribute20"}}, # Attribute39 (not in LDIF) takes value of Attribute20
                "Attribute40"
}).TotalSeconds

"Writing CSV file..." # < 1 sec
(Measure-Command { $out | Export-CSV $OUT_csv_file -NoTypeInformation }).TotalSeconds

Note: I actually don't need to export the "$reduced" data to a file (e.g. "$reducedLDIF"), but the piece of code I found for the parsing seems to require a file.

Thanks!

`select-string` can read a file without the need to do `get-content` first. Just use its `-Path` parameter and pass in the file name. — AdminOfThings, Mar 24 '20 at 13:07
why match on `^$` in your `select-string` when you want to remove empty lines later? — AdminOfThings, Mar 24 '20 at 13:18
Curious: Is the LDIF file the actual target of the underlying work or are you trying to ultimately have a report of particular bits of data out of AD or a different LDAP database? If so, there are almost certainly easier ways to get this information natively out of the database than parsing and converting an LDIF file. — thepip3r, Mar 24 '20 at 13:57
Thanks AdminOfThings. The direct `select-string` simplifies the code indeed but I see no change to performance. I need a match on `^$` because I want the LDIF file contents to keep at least one blank line between each entry (otherwise the parsing gets quite complicated). — Chris, Mar 24 '20 at 14:03
@thepip3r: My goal is to extract the data from an OpenLDAP server on Linux and format it into a specific CSV format. This is done fine with the current code... I just feel it's still slow - or I may have unrealistic expectations given the amount of data and my hardware doing the job. — Chris, Mar 24 '20 at 14:06
You're using a StreamReader, not emitting erroneous messages to the console, and not loading large quantities of info into memory -- all good things you're not doing that have big impacts on performance. You can use the DirectoryEntry and DirectorySearcher classes in .NET via PowerShell to pull this information directly though. If end-to-end performance of the final report is what you're after, it might be worth looking at. Don't forget to include the additional time of running LDIFDE or CSVDE or whatever tool you're using to generate the LDIF in the first place in the final comparison. ;) — thepip3r, Mar 24 '20 at 14:12
@thepip3r: That's what I was thinking... I was wondering if an equivalent of 'StreamReader' exists that could read a variable instead of a file (so I could work on the `$reduced` variable directly instead of the `$reducedLDIF` file). The DirectoryServices classes would require that I have an LDAP netflow from the machine running this script, which is not the case. The LDIF is generated on the server (using slapcat) then the file is copied to the machine running the script. — Chris, Mar 24 '20 at 14:38
@Chris, the DirectorySearcher class returns an IEnumerable collection which has special meaning. This means that you can defer execution until a specified time for optimization. Using LINQ, you can pair down the collection before converting the collection to a structure where you will incur the load of data in RAM. Even so, because it's all in RAM and not reading from disk (even considering your SSD), your performance gains could be significantly higher. I don't understand 'LDAP netflow'. Are you saying you can't query TCP 389 from your machine to the target LDAP server? — thepip3r, Mar 24 '20 at 15:17
@thepip3r, that's right, I cannot query TCP 389 from my machine to the target LDAP server. I feel your suggestions would require significant work/changes in the code... so I'm not sure I'll look into it yet, but good to know; thanks. — Chris, Mar 24 '20 at 15:46
@Chris, yes they would... it would be a completely different direction but you seemed to be conncered most about performance so I wanted to offer a different route--which turned out to be irrelevant if you can't query LDAP directly from your machine so disregard. ;) — thepip3r, Mar 24 '20 at 15:50

score 0 · Answer 1 · answered Mar 24 '20 at 15:53

So I found a way to cut the parsing time by almost half, by re-using the data in the $reduced variable that's already in memory:

    $a = @{}                # initialize the temporary hash
    $lineNum = $rcdNum = 0  # initialize the counters
    "Parsing reduced LDIF file..."
(Measure-Command { 
    $results = ForEach ($line in $reduced) {
        Write-Verbose "$("{0:D6}" -f ++$lineNum)|$("{0:D4}|" -f $rcdNum)$line"
        if ($line -match "^\s*$")
        {   # blank line or end of stream - dump the hash as an object and reinit the hash
            [PSCustomObject]$a
            $a = @{}
            $rcdNum++
        }
        else {
            # build up hash table for the object
            $key, $value = $line -split ": "
            $a[$key] = $value
        }
    }
}).TotalSeconds

This is already more acceptable (about 16 sec instead of 30).

Powershell: improving LDIF file to CSV conversion

1 Answers1