This is an alternative solution that uses a streaming approach based on XmlReader
and XmlWriter
only. Compared to my first solution, it does not limit the size of the input file depending on the amount of available RAM.
While my first solution reads the whole input file into an XmlDocument
in memory, this one only keeps as many log entries in memory, as needed for the output file.
Also it is much faster than the first solution, because it doesn't incur the overhead of creating a DOM (a log file of 63 MB with 100k entries took about 1.5 seconds to process using the current solution, while it took more than 6 minutes(!) using my first solution).
A disadvantage is that the code is more lengthy than my first solution.
$inputPath = "$PWD\log.xml"
$outputPath = "$PWD\log_new.xml"
# Maximum size of the output file (which can be slightly larger as we only
# count the size of the log entries).
$maxByteCount = 4KB
$writerSettings = [Xml.XmlWriterSettings] @{
Encoding = [Text.Encoding]::Unicode # UTF-16 as in input document
# Replace with this line to encode in UTF-8 instead
# Encoding = [Text.Encoding]::UTF8
Indent = $true
IndentChars = ' ' * 4 # should match indentation of input document
ConformanceLevel = [Xml.ConformanceLevel]::Auto
}
$entrySeparator = "`n" + $writerSettings.IndentChars
$totalByteCount = 0
$queue = [Collections.Generic.Queue[object]]::new()
$reader = $writer = $null
try {
# Open the input file.
$reader = [Xml.XmlReader]::Create( $inputPath )
# Create or overwrite the output file.
$writer = [Xml.XmlWriter]::Create( $outputPath, $writerSettings )
$writer.WriteStartDocument() # write the XML declaration
# Copy the document root element and its attributes without recursing into child elements.
$null = $reader.MoveToContent()
$writer.WriteStartElement( $reader.Name )
$writer.WriteAttributes( $reader, $false )
# Loop over the nodes of the input file.
while( $reader.Read() ) {
# Skip everything that is not an XML element
if( $reader.NodeType -ne [xml.XmlNodeType]::Element ) {
continue
}
# Read the XML of the current element and its children.
$xmlStr = $reader.ReadOuterXml()
# Calculate how much bytes the current element takes when written to file.
$byteCount = $writerSettings.Encoding.GetByteCount( $xmlStr + $entrySeparator )
# Append XML string and byte count to the end of the queue.
$queue.Enqueue( [PSCustomObject]@{
xmlStr = $xmlStr
byteCount = $byteCount
})
$totalByteCount += $byteCount
# Remove entries from beginning of queue to ensure maximum size is not exceeded.
while( $totalByteCount -ge $maxByteCount ) {
$totalByteCount -= $queue.Dequeue().byteCount
}
}
# Write the last log entries, which are below maximum size, to the output file.
foreach( $entry in $queue ) {
$writer.WriteString( $entrySeparator )
$writer.WriteRaw( $entry.xmlStr )
}
# Finish the document.
$writer.WriteString("`n")
$writer.WriteEndElement()
$writer.WriteEndDocument()
}
finally {
# Close the input and output files
if( $writer ) { $writer.Dispose() }
if( $reader ) { $reader.Dispose() }
}
The algorithm basically works like this:
- Create a queue of custom objects that store the XML and the size in bytes per log entry.
- For each log entry of the input file:
- Read the XML of the log entry and calculate the size in bytes (as on disk, applying the output encoding) of the log entry. Add this data to the end of the queue.
- If necessary, remove log entries from the beginning of the queue to ensure the desired maximum size in bytes is not exceeded.
- Write the log entries from the queue to the output file.
- For simplicity we only consider the size of the log entries, so the actual output file could be slightly larger, due to the XML declaration and the document root element.