0

I want to extract elements from a webpage.

$html = file_get_contents($link);

That function returns the complete html file, and I only want the internal and external links to save them in the database.

$sql = "INSERT INTO prueba (link, title, description) VALUES (?, ?, ?)";

//preparando los datos
$query = $pdo->prepare($sql);

//orden de ejecucion
$result = $query->execute([
  $link,
  $title_out,
  $description
]);

Here, I am already managing to extract the description and the title, and I manage to place them in the database, but I want to extract all the external and internal links. The internal links in one column and the external links in another. I already have both columns in the database created.

Shadow
  • 33,525
  • 10
  • 51
  • 64

2 Answers2

0

I suggest using a DOM-Parser library like:

Parse the HTML and just "query" for all anchors (a tags).

Much less error-prone than trying to extract them by yourself using regexes for example.

Tobias K.
  • 2,997
  • 2
  • 12
  • 29
  • I do not think I should use a complete library for such a short job. However, I am more interested in knowing how to separate both links, internal and external. Because the webcrawler is going to have to move through the external links to reach other pages and be able to crawl that pages. – Diesan Romero Jul 01 '18 at 20:25
0

HTML scrapping

For that I advice you to use opensource libraries that provide helping functions to navigate into the DOM. Without this you'll have to maintain so much more code. If you want to manage scrapping to multiples pages, you'll have to updade your regex queries at each update of the page.

You don't want that ^^'

One example from "Goutte" library ( I hope you are in +PHP 5.5)

$links = [];
$crawler->filter('a')->each(function ($node) {
    var_dump($node->attr('href'));
    $links[] = $node->attr('href');
});

$links now contains all the links a attribute in the page

For more example about node travelling please see this link

Use your database logic to persist this data

Sorry if there is an error into Goutte's code I don't use it often

Mcsky
  • 1,426
  • 1
  • 10
  • 20