I need to migrate a website to a new CMS. We do not have access the original site except via http://mysite.com. We currently have a variety of scripts that i). index the site and the ii). create some hierarchy and iii). scrape the unique content (ie. ignore header/ footer/ template etc). The scripts actually work really quite well except the indexing the site. Is there a good utility that can index all the unique URLs of a site.
Currently we use a mixture of
$oHTML = new simple_html_dom();
$oHTML->setBody(file_get_contents('http://mysite.com'));
foreach($oHTML->find('a') as $oLink) {}
and a recursive function to hit all the unique links...
The question is... PHP is slow and hits memory limits fast... is this the right thing to do? Can I use sphinx or an opensource search engine or something to do it for me...