0

I need to migrate a website to a new CMS. We do not have access the original site except via http://mysite.com. We currently have a variety of scripts that i). index the site and the ii). create some hierarchy and iii). scrape the unique content (ie. ignore header/ footer/ template etc). The scripts actually work really quite well except the indexing the site. Is there a good utility that can index all the unique URLs of a site.

Currently we use a mixture of

$oHTML = new simple_html_dom();
$oHTML->setBody(file_get_contents('http://mysite.com'));
foreach($oHTML->find('a') as $oLink) {}

and a recursive function to hit all the unique links...

The question is... PHP is slow and hits memory limits fast... is this the right thing to do? Can I use sphinx or an opensource search engine or something to do it for me...

Simon
  • 5,158
  • 8
  • 43
  • 65
  • try looking at online sitemap generators to generate a XML file of all the sites URLS – Scott Nov 09 '10 at 15:57
  • I have looked at them previously. The site has about 3k pages of varying depths. None I have found have been satisfactory. – Simon Nov 09 '10 at 16:04
  • 1
    what is your plan to deploy sphinx? for 3k urls, sphinx probably just need few seconds for indexing – ajreal Nov 09 '10 at 16:13
  • a few seconds? surely the latency of requesting 3k urls will be more than that. i have not used sphinx before i was just suggesting an alternative to writing my own indexer. – Simon Nov 09 '10 at 17:20
  • that is for sphinx to re-index ... the crawling parts is depend on your crawler script + the speed of the sites u r crawling to, sphinx has great user support feel free to post your question to http://sphinxsearch.com/ – ajreal Nov 09 '10 at 17:25
  • between, what is your "indexer" refer to ? – ajreal Nov 09 '10 at 17:27

1 Answers1

0
  1. use wget to crawl the sites, and archive to local disk
  2. after completed, do a find for all files (assuming *.htm), do a strip_html_tags, and insert into database
  3. then use sphinx pecl library to do indexing sphinx::buildExcerpts

Or, after step 2
just run the indexer for sphinx re-index

ajreal
  • 46,720
  • 11
  • 89
  • 119