0

I am trying to export files from a data dump and I am in dire need of some assistance. All of the files that I am trying to export are in pdf, doc, xlsx, jpg, and png format. Due to how the data dump was assembled, the files were renamed to f0.pdf, f0.doc, etc., In addition, the files are found in different subfolders (ex: Data\000\004\0000001212). Furthermore, within a subfolder, if there is a file in there it is accompanied by a m.xml file (for reference please see pic here). The m.xml file is important as it contains the original filename reflected by the "LDDOCUMENTNAME" field:

ex: <TextVar length="255" field="LDDOCUMENTNAME">ABC.pdf</TextVar>

I attempted to rename and export the files using PowerShell however some of the pdf files did not go through (I searched for all the pdf files in the subfolders and compared it to the number of exported pdf files).

This is what my script looks like:

$fsoFiles = Get-ChildItem -Path C:\Files -Filter *m.xml* -Recurse
ForEach($fsoFile in $fsoFiles)
{
    $docM = Select-String $fsoFile -Pattern "LDDOCUMENTNAME"
    $txtNewFile = $docM.Line.Substring(0,($docM.Line.Length-10))
    $txtNewFile = $txtNewFile.Split(">")[-1]
    $txtExtension = $txtNewFile.Split(".")[-1]
    $txtOldFile = ([string]$fsoFile.Directory+"\"+"f0."+$txtExtension)
    Copy-Item $txtOldFile C:\Extracted\$txtNewFile
}

Essentially I asked PowerShell to search through all the subfolders and filter out only the folders with a m.xml file. PowerShell is then supposed to rename the corresponding file back to its original filename using the value found in the "LDDOCUMENTNAME" field.

When I run my script I am presented with a bunch of these error messages:

You cannot call a method on a null-valued expression.
    At line:6 char:5
    +     $txtNewFile = $docM.Line.Substring(0,($docM.Line.Length-10))
    +     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        + CategoryInfo          : InvalidOperation: (:) [], RuntimeException
        + FullyQualifiedErrorId : InvokeMethodOnNull

I'm assuming this is the reason why PowerShell could not export some of the pdf files? Maybe the "LDDOCUMENTNAME" field in the corresponding m.xml files are blank?

I tried adding a IF statement inside of my FOR loop to see if I could get a location of the files that could not be exported but I was met with the same error messages:

    If ($docM = $null)

     {
        Get-ChildItem -Path C:\Files -include !$docM -Recurse -Force -Name C:\Extracted\listofPaths.txt

        }

    else

Does anybody here know of a way to accomplish this? I am literally pulling my hair out. Any help would be much appreciated. Thanks!

V Y
  • 1
  • 2
  • Why don't you just parse the XML as an XML document? No need to split-substring modifications? – vonPryz Mar 29 '17 at 05:03
  • Thanks for replying! This is actually my first time working with PowerShell. Could you point me in the right direction as to how I can accomplish that? Thanks! – V Y Mar 29 '17 at 05:15
  • Right direction would be googling... Anyway, [SO](http://stackoverflow.com/a/11344234) has already nice an answer. If you have problems with implementation, please provide valid XML document instead of fragment. – vonPryz Mar 29 '17 at 05:34
  • Here is one of the xml files. https://pastebin.com/VcbVu4rg Thanks! – V Y Mar 29 '17 at 05:45

1 Answers1

0

As the XML file is not trivial, it should not be processed as text. Load it as an XML one and use XPath to pick relevant nodes. Like so,

# XML is 1st class citizen in Powershell 
[xml]$doc = get-content c:\path\to\doc.xml 
# Select all the TextVar nodes that have attribute field='LDDOCUMENTNAME'
$nl = $doc.selectnodes("//TextVar[@field='LDDOCUMENTNAME']")
# Did we find one?
if($nl.count -eq 1) {
    # Do something with the element's text data
    # Rename the data file would happen here, for now
    # print the results for further review
    Write-Host $nl[0].InnerText 
}
# Todo: handle no elements found case
# Todo: handle multiple elements found case
vonPryz
  • 22,996
  • 7
  • 54
  • 65