-2

Sorry for such a long post. I request to read till the end to understand what I am trying to accomplish and what my roadblock is!

I have a table like this

<html>
   <body>
      <table class="searchTable" cellspacing="0" cellpadding="5" style="width: 100%;">
         <tbody>
            <tr>
               <th>Book</th>
               <th>Model Name</th>
               <th>Description</th>
               <th>Category</th>
            </tr>
            <tr>
               <td>
                  <a onclick="getFullData( '', 'K0072', 'B20' );" href="">K0072</a>
               </td>
               <td>B20</td>
               <td>K0072 Description</td>
               <td>K0072 Category</td>
            </tr>
            <tr>
               <td>
                  <a onclick="getFullData( '', 'K0074', 'B2004' );" href="">K0072</a>
               </td>
               <td>B2004</td>
               <td>K0074 Description</td>
               <td>K0074 Category</td>
            </tr>
            <tr>
               <td>
                  <a onclick="getFullData( '', 'K0081', 'B2005' );" href="">K0072</a>
               </td>
               <td>B2005</td>
               <td>K0081 Description</td>
               <td>K0081 Category</td>
            </tr>
         </tbody>
      </table>
   </body>
</html>

Please note, I am fetching the data from another website using cURL POST method. Which means, I have no control over the HTML.

I am able to generate the following array from the above HTML using DOMDocument.

array (size=3)
  0 => 
    array (size=4)
      0 => string 'K0072' (length=5)
      1 => string 'B20' (length=3)
      2 => string 'K0072 Description' (length=17)
      3 => string 'K0072 Category' (length=14)
  1 => 
    array (size=4)
      0 => string 'K0074' (length=5)
      1 => string 'B2004' (length=5)
      2 => string 'K0074 Description' (length=17)
      3 => string 'K0074 Category' (length=14)
  2 => 
    array (size=4)
      0 => string 'K0081' (length=5)
      1 => string 'B2005' (length=5)
      2 => string 'K0081 Description' (length=17)
      3 => string 'K0081 Category' (length=14)

This is my code:

$doc = new DOMDocument();
$doc->loadHTML( getHtml() );
$doc->preserveWhiteSpace = false;
$doc->encoding = 'UTF-8';

$tables  = $doc->getElementsByTagName( 'table' );
$dataArray = array();
$count = 0;

foreach( $tables as $table ) {

   if ( $table->getAttribute( 'class' ) !== 'searchTable' ) {
        continue;
   }
            
   // $header = $doc->getElementsByTagName( 'th' );
   // $rows   = $doc->getElementsByTagName( 'tr' );
   $data   = $doc->getElementsByTagName( 'td' );

   $tempArray = array();

   foreach ( $data as $td ) {
      $value = trim( $td->textContent );
      $tempArray[$count] = $value;
      $count++;
                
      if( 0 !== $count && $count % 4 === 0 ) {
         array_push( $dataArray, $tempArray  );
         $tempArray = array();
         $count = 0;
      }
   }
}

var_dump( $dataArray );
die();

The problem is I am not able to extract the argument values of getFullData method for each record. Because, I need to build URLs based on the arguments, for example: https://pi.php?part=K0072&m=B20.

<tr>
    <td>
       <a onclick="getFullData( '', 'K0072', 'B20' );" href="">K0072</a>
    </td>
    ...
</tr>

I had seen somehere (can't remember now! :( ) that DOXPath could be used to find DOM elements by using element attribute, like here I probably may use the onlick attribute.

But the problem is, the source document has other anchors as well which are calling differnt methods. This means there would be unnecessary records in PHP for filtering.

Is there a way that allows me to extract only those anchors which are calling the getFullData method? Also how would I extract the argument values?

End of day, the final array has to look like this:

array (size=3)
  0 => 
    array (size=5)
      0 => string 'K0072' (length=5)
      1 => string 'B20' (length=3)
      2 => string 'K0072 Description' (length=17)
      3 => string 'K0072 Category' (length=14)
      4 => string 'https://pi.php?part=K0072&m=B20'
  1 => 
    array (size=5)
      0 => string 'K0074' (length=5)
      1 => string 'B2004' (length=5)
      2 => string 'K0074 Description' (length=17)
      3 => string 'K0074 Category' (length=14)
      4 => string 'https://pi.php?part=K0074&m=B2004'
  2 => 
    array (size=5)
      0 => string 'K0081' (length=5)
      1 => string 'B2005' (length=5)
      2 => string 'K0081 Description' (length=17)
      3 => string 'K0081 Category' (length=14)
      4 => string 'https://pi.php?part=K0081&m=B2005'

Any suggestion?

UPDATE:

Thank you Chris Hass for driving me to some direction. Taking the idea, I just tried this and got some potential results!

$ancs = $xPath->query( "//a[@onclick]" );
foreach( $ancs as $a ) {
    var_dump ( $a->getAttribute( 'onclick' ) );
}
Subrata Sarkar
  • 2,975
  • 6
  • 45
  • 85
  • First, if that document is representative of the whole, you don’t need to think about JS, the arguments are always the first two columns AFAICT. Second, I’d recommend going back to scanning for TR tags, then sub-scanning for TD tags instead of doing the mod 4 test. In that sub-scan, you could also test if the first TD has a child A, and if it has an `onclick` attribute. If my first assumption isn’t correct, you could then parse it with RegEx – Chris Haas Sep 09 '21 at 11:46
  • @ChrisHaas thank you for quick reply. The document would be longer in reality, but only there will be more `...`. I kind of understnad what you are saying, but would be extremely helpful if you please provide me with an example code. – Subrata Sarkar Sep 09 '21 at 11:56

2 Answers2

0

I'm not going to rewrite everything for you, but this should hopefully get you the gist. I'm completely skipping XPath stuff, some people really like it, and the selectors allow you to make more concise rules, but I think simple is often the best for starters.

As noted in my comment, I'm grabbing each individual tr first, sanity checking the contents and trying to continue as much as possible. For the onclick I'm parsing as RegEx, removing the function wrappers, and then parsing it as CSV since that is what it basically is.

Here's an online sample of this, too.

$html = <<<'TAG'
<html>
   <body>
      <table class="searchTable" cellspacing="0" cellpadding="5" style="width: 100%;">
         <tbody>
            <tr>
               <th>Book</th>
               <th>Model Name</th>
               <th>Description</th>
               <th>Category</th>
            </tr>
            <tr>
               <td>
                  <a onclick="getFullData( '', 'K0074', 'B2004' );" href="">K0072</a>
               </td>
               <td>B2004</td>
               <td>K0074 Description</td>
               <td>K0074 Category</td>
            </tr>
        </tbody>
        </table>
    </body>
</html>
TAG;

$doc = new DOMDocument();
$doc->loadHTML($html);
$tables = $doc->getElementsByTagName('table');
foreach ($tables as $table) {
    if ($table->getAttribute('class') !== 'searchTable') {
        continue;
    }

    $rows = $table->getElementsByTagName('tr');
    foreach ($rows as $row) {
        $cells = $row->getElementsByTagName('td');

        // Sanity check that we have four rows or skip it
        if (4 !== count($cells)) {
            continue;
        }

        // Sanity check that the first cell has a link inside it
        $firstCellLinks = $cells[0]->getElementsByTagName('a');
        if (1 !== count($firstCellLinks)) {
            continue;
        }

        // Make sure the first link has an onclick attribute
        if (!$firstCellLinks[0]->hasAttribute('onclick')) {
            continue;
        }

        // Finally, get the contents of the cell. This can be simplified to a one-line but
        // I've expanded it to be more obvious.
        $firstCellLinkOnClick = $firstCellLinks[0]->attributes['onclick']->value;
        $firstCellLinkOnClickParamsAsString = preg_replace('/getFullData\(([^)]+)\);/', '$1', $firstCellLinkOnClick);
        $firstCellLinkOnClickParamsAsArray = str_getcsv($firstCellLinkOnClickParamsAsString, ',', "'");

        print_r($firstCellLinkOnClickParamsAsArray);
        /*
        Array
        (
            [0] =>
            [1] => K0074
            [2] => B2004
        )
         */
    }
}
Chris Haas
  • 53,986
  • 12
  • 141
  • 274
0

Try something like this, using xpath:

$dom = new DOMDocument();
$dom->loadXML($html);
$xpath = new DOMXPath($dom);

$dataArray = array();
$link='https://pi.php?part=K0072&m=';
$targets = $xpath->query("//tr[.//a]");
foreach ($targets as $tr)
{   
    $count = 0;
    $tempArray = array();
    foreach ($xpath->query('.//td',$tr) as $target) {
        $tempArray[$count] = trim($target->textContent);
        $count++;
}
    $anc = $xpath->query('.//td',$tr)[1]->nodeValue;
    $tempArray[$count] = $link.$anc;
    array_push( $dataArray, $tempArray  );
};
var_dump( $dataArray );
Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45