0

I am trying to pull an HTML table from a webpage using PowerShell, but I'm having trouble calling the table itself. There are two tables on the page, one for input and another for output, and ideally I would like to check if the output table contains anything (apart from a specific string to indicate no results), and if it does put the information from said table into a file.

I've tried using Invoke-Webrequest's ParsedHtml property, but the tables don't have specific element names or ID's, nor do they have 'class' or 'title' tags to differentiate the two. Using the .IHTMLDocument2_all property did show several COMObjects (in the format TypeName: System.__ComObject#{3050f539-98b5-11cf-bb82-00aa00bdce0b}) that I feel I need to somehow call in order to get what I need, but I can't figure out how to do so.

Is there a way to call those COMObjects, so I can pull the information from inside of them?

Here is the HTML for the table I am trying to pull results from (when there are no results):

<Center>
<TABLE CELLSPACING=0 CELLPADDING=0 BORDER=2><TR><TD>
<TABLE  CELLSPACING=0 CELLPADDING=2 BORDER=0>
<TR><TD BGCOLOR=3399FF ALIGN=CENTER><NOBR><FONT FACE="Arial" SIZE=+1><B>&nbsp;&nbsp; Search Results &nbsp;&nbsp;</B></FONT></NOBR></TD></TR>
<TR><TD><TABLE WIDTH=100% CELLSPACING=0 CELLPADDING=2 BORDER=0>
    <Center>
    <table width="100%" cellpadding="5" cellspacing="0">

        <tr>
            <td>No assets were found for the search</td>
        </tr>
</TABLE></TD></TR>
</TABLE></TD></TR>
</TABLE>
</Center>

When there are results, there are several headers under which the results are displayed, in this code:

<Center>
<TABLE CELLSPACING=0 CELLPADDING=0 BORDER=2><TR><TD>
<TABLE  CELLSPACING=0 CELLPADDING=2 BORDER=0>
<TR><TD BGCOLOR=3399FF ALIGN=CENTER><NOBR><FONT FACE="Arial" SIZE=+1><B>&nbsp;&nbsp; Search Results &nbsp;&nbsp;</B></FONT></NOBR></TD></TR>
<TR><TD><TABLE WIDTH=100% CELLSPACING=0 CELLPADDING=2 BORDER=0>
    <Center>
    <table width="100%" cellpadding="5" cellspacing="0">

        <tr bgcolor=A9A9A9>

        <th>HEADER1</th>
        <th>HEADER2</th>
        <th>HEADER3</th>
        <th>HEADER4</th>
        <th>HEADER5</th>
        <th>HEADER6</th>
        <th>HEADER7</th>
        <th>HEADER8</th>
        <th>HEADER9</th>
        <th>HEADER10</th>
        <th>HEADER11</th>
        <th>HEADER12</th>
        <th>HEADER13</th>

        </tr>

            <tr >

                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTS</td>

                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>

                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>

                <td nowrap><font size= "-1" color=000000> </td>
                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>
                <td nowrap><font size= "-1" color=000000> </td>

            <tr>

            <tr bgcolor=C0C0C0>

                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>

                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>

                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>

                <td nowrap><font size= "-1" color=000000> </td>
                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>
                <td nowrap><font size= "-1" color=000000> </td>

            <tr>

            <tr >

                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>

                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>

                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>

                <td nowrap><font size= "-1" color=000000> </td>
                <td nowrap><font size= "-1" color=000000>**RESULTS**</td>
                <td nowrap><font size= "-1" color=000000> </td>

            <tr>
</TABLE></TD></TR>
</TABLE></TD></TR>
</TABLE>
</Center>

Ideally, I would like to check if assets were found, and if they were, pull the results from under headers 1, 2, 3, 6, and 7 into a usable form (most likely a table or a .csv file). Any help is greatly appreciated.

Cameron
  • 171
  • 3
  • 15
  • Can you get the HTML for the page? You may be able to use my answer from [this other question](http://stackoverflow.com/questions/25940510/how-to-extract-specific-tables-from-html-file-using-native-powershell-commands/25942395#25942395) to get the info you're looking for. – TheMadTechnician Jan 18 '17 at 23:14
  • Can you give acces to the URL or give an example ? – JPBlanc Jan 19 '17 at 04:56
  • I'm afraid it is a site designed by and used exclusively for the company I work for, hosted on our intranet, I cannot provide the full site. I will however edit my question with a snippet of the html – Cameron Jan 19 '17 at 14:52
  • @TheMadTechnician I did actually look at that question before posting this one, unfortunately I could not find a table id in anything other than the unique comobjects I got through using parsedhtml, which I could not for the life of me access – Cameron Jan 19 '17 at 15:13
  • You said there's two tables. What does the HTML look like for the input table, and is it always before the results table? Is there always an input table, and a results table (even if the results are that nothing was found)? – TheMadTechnician Jan 19 '17 at 16:13
  • The HTML for the input table is almost identical to that of the output table (including the formatting) with the exception that the input table is static and contains four input forms whereas the output table is dynamic. The input table remains the same regardless of what is input, whereas the results table may say 'no assets found' or contain a list of assets – Cameron Jan 19 '17 at 18:35

1 Answers1

1

Ok, so if you ask around most people will strongly discourage parsing HTML with RegEx. They're probably right, but I'm stubborn and feel that RegEx is flexible enough to handle certain tasks, even in HTML. So I've adapted my answer in the linked question to what I think will work for you.

This relies on the fact that your inner most table, which contains the data you are looking for, starts with the line:

<table width="100%" cellpadding="5"

...and does not have another table embedded within it. So it's fairly specific, but it works with the examples that you've provided.

I created a here-string from your example as such:

$Sample = @"
<Center>
<TABLE CELLSPACING=0 CELLPADDING=0 BORDER=2><TR><TD>
<TABLE  CELLSPACING=0 CELLPADDING=2 BORDER=0>
<TR><TD BGCOLOR=3399FF ALIGN=CENTER><NOBR><FONT FACE="Arial" SIZE=+1><B>&nbsp;&nbsp; Search Results &nbsp;&nbsp;</B></FONT></NOBR></TD></TR>
<TR><TD><TABLE WIDTH=100% CELLSPACING=0 CELLPADDING=2 BORDER=0>
    <Center>
    <table width="100%" cellpadding="5" cellspacing="0">

        <tr bgcolor=A9A9A9>

        <th>HEADER1</th>
        <th>HEADER2</th>
        <th>HEADER3</th>
        <th>HEADER4</th>
        <th>HEADER5</th>
        <th>HEADER6</th>
        <th>HEADER7</th>
        <th>HEADER8</th>
        <th>HEADER9</th>
        <th>HEADER10</th>
        <th>HEADER11</th>
        <th>HEADER12</th>
        <th>HEADER13</th>

        </tr>

            <tr >

                <td nowrap><font size= "-1" color=000000>**RESULTSA**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTSA**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTSA</td>

                <td nowrap><font size= "-1" color=000000>**RESULTSA**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTSA**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTSA**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTSA**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTSA**</td>

                <td nowrap><font size= "-1" color=000000>**RESULTSA**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTSA**</td>

                <td nowrap><font size= "-1" color=000000> </td>
                <td nowrap><font size= "-1" color=000000>**RESULTSA**</td>
                <td nowrap><font size= "-1" color=000000> </td>

            <tr>

            <tr bgcolor=C0C0C0>

                <td nowrap><font size= "-1" color=000000>**RESULTSB**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTSB**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTSB**</td>

                <td nowrap><font size= "-1" color=000000>**RESULTSB**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTSB**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTSB**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTSB**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTSB**</td>

                <td nowrap><font size= "-1" color=000000>**RESULTSB**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTSB**</td>

                <td nowrap><font size= "-1" color=000000> </td>
                <td nowrap><font size= "-1" color=000000>**RESULTSB**</td>
                <td nowrap><font size= "-1" color=000000> </td>

            <tr>

            <tr >

                <td nowrap><font size= "-1" color=000000>**RESULTSC**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTSC**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTSC**</td>

                <td nowrap><font size= "-1" color=000000>**RESULTSC**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTSC**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTSC**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTSC**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTSC**</td>

                <td nowrap><font size= "-1" color=000000>**RESULTSC**</td>
                <td nowrap><font size= "-1" color=000000>**RESULTSC**</td>

                <td nowrap><font size= "-1" color=000000> </td>
                <td nowrap><font size= "-1" color=000000>**RESULTSC**</td>
                <td nowrap><font size= "-1" color=000000> </td>

            <tr>
</TABLE></TD></TR>
</TABLE></TD></TR>
</TABLE>
</Center>
"@

Then I looked for the specific string that I mentioned above using RegEx, and grabbed everything up to the next </table> tag.

[regex]$regex = '(?s)<table width="100%" cellpadding="5" .*?</TABLE>'
$tables = $regex.matches($Sample).groups.value

After that I split that on the <tr> tags to get individual rows.

ForEach($String in $tables){
    $TableRows = $string -split '<tr.*?>'

The next three bits are all one line that I capture in a variable.

First on each row I looked for columns or headers, and I joined them with a comma.

$CurTable = $TableRows | ForEach-Object{$_ -split "(?s)</T(?:D|H)>.*?<T(?:D|H).*?>" -join ","

Then I replaced any remaining <TD>,</TD>,<TH>, and </TH> tags to remove any leading or trailing tags. I also removed the <font> tags to keep things cleaner, as well as any line breaks, because any single given row should only be one line.

-replace "<(/?T(D|H|R|ABLE)|font).*?>" -replace "[\r\n]"

Then trim any spaces or commas from the beginning or end of the lines, and only output lines that have text on them, and we actually end up with a pretty standard CSV.

| ForEach-Object{$_.Trim(' ,')} | ?{![string]::IsNullOrWhiteSpace($_)}

Once you have the CSV you can convert it to objects easily enough, select only the properties that you want, and export to a CSV, or use Out-GridView, or even simply Format-Table if you just want to see the text. Or filter the results... it gets pretty easy to work with the data from there.

Now, there is the possibility that there are no results, in which case all you end up with is a string, and not a CSV. What I did to accomodate that is to check if the results were an array or not. If it is an array, you have data to work with. If it's not an array, then the results table had nothing in it, and I chose to simply output that to screen. Here's how I handled that:

    If($CurTable -is [array]){
        $CurTable |ConvertFrom-Csv|Select 'HEADER1','HEADER2','HEADER3','HEADER6','HEADER7' #|Export-Csv "C:\Path\To\Output\Results.csv" -NoTypeInformation
    }Else{
        $CurTable
    }
}

My answer got pretty long, but the actual functional code boils down to just this:

[regex]$regex = '(?s)<table width="100%" cellpadding="5" .*?</TABLE>'
$tables = $regex.matches($Sample).groups.value
ForEach($String in $tables){
    $TableRows = $string -split '<tr.*?>'
    $CurTable = $tablerows|%{$_ -split "(?s)</T(?:D|H)>.*?<T(?:D|H).*?>" -join "," -replace "<(/?T(D|H|R|ABLE)|font).*?>" -replace "[\r\n]"} | ForEach-Object{$_.Trim(' ,')} | ?{![string]::IsNullOrWhiteSpace($_)}
    If($CurTable -is [array]){
        $CurTable |ConvertFrom-Csv|Select 'HEADER1','HEADER2','HEADER3','HEADER6','HEADER7' #|Export-Csv "C:\Path\To\Output\Results.csv" -NoTypeInformation
    }Else{
        $CurTable
    }
}

That will result in:

HEADER1 : **RESULTSA**
HEADER2 : **RESULTSA**
HEADER3 : **RESULTSA
HEADER6 : **RESULTSA**
HEADER7 : **RESULTSA**

HEADER1 : **RESULTSB**
HEADER2 : **RESULTSB**
HEADER3 : **RESULTSB**
HEADER6 : **RESULTSB**
HEADER7 : **RESULTSB**

HEADER1 : **RESULTSC**
HEADER2 : **RESULTSC**
HEADER3 : **RESULTSC**
HEADER6 : **RESULTSC**
HEADER7 : **RESULTSC**

Hopefully that's enough for you to work with in order to get what you need.

TheMadTechnician
  • 34,906
  • 3
  • 42
  • 56
  • So, the other table on the form (the input table) does have the same opener you used for your regex statement, however there is a heading (the search results heading) which I believe I should be able to use in the same way you did here. I'm going to play with it for a while and see how it goes. – Cameron Jan 19 '17 at 18:31
  • When you are using this method, do you first use invoke-webrequest to access the site and then use regex, setting the webrequest html equal to $sample in your example? I'm fairly new to powershell and I don't completely understand how you are getting your results – Cameron Jan 19 '17 at 19:42
  • Yeah, I didn't have a web page to run against, but you would do your `Invoke-WebRequest` and use the results of that to run the code against, probably using the `ParsedHTML` property. – TheMadTechnician Jan 19 '17 at 19:47
  • Unfortunately I still seem to be getting an odd error. 'Method invocation failed because [Microsoft.PowerShell.Commands.HtmlWebResponseObject] does not contain a method named 'op_Subtraction'. At line:2 char:1 + $html- $site.content + ~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : InvalidOperation: (op_Subtraction:String) [], RuntimeException + FullyQualifiedErrorId : MethodNotFound' This is especially strange since I did not try to use any method like 'op_Subtraction'. – Cameron Jan 19 '17 at 20:14
  • That's nothing to do with my code, it looks like you are running `$html - $site.content` at some point. Was that supposed to be `=` instead of `-` perhaps? – TheMadTechnician Jan 19 '17 at 20:45
  • You are correct, sir. It still isn't giving me the correct output to the csv unfortunately, but it at least is no longer giving me errors, and I feel you have spent more than enough of your time helping me. If I had higher rep I would give you +1, generous sir – Cameron Jan 20 '17 at 15:35