I really want to make a website crawler that goes to a website, scans it for links, puts the links in a database and moves on to another website. I found one website but the code was really buggy. If you have seen anything like this or have written one your self.
Asked
Active
Viewed 1,583 times
0
-
1How many sites are you wanting to crawl? Unless you are spawning multiple PHP processes on the server, you are going to have trouble. PHP is single-threaded, and you won't be efficiently crawling pages. – Brad Jan 19 '11 at 15:14
-
`please post the code, not the website!` I highly discourage/disagree with that, the website will be of much greater use then pre-cooked code, also for future reference. – orlp Jan 19 '11 at 15:14
-
is there any other language that is more efficient? I just want a web crawler – Jan 19 '11 at 15:16
-
2You'll find more readymade crawlers in the Perl area. WWW::Mechanize comes to mind. – mario Jan 19 '11 at 15:19
-
i dont really know perl so if possible make it in php/python/js – Jan 19 '11 at 15:21
-
a gui for mac would also work – Jan 19 '11 at 15:21
-
1Begging won't get you anywhere, have some dignity. – RobertPitt Jan 19 '11 at 15:32
-
You should read this similar item: http://stackoverflow.com/questions/1733599/is-there-a-list-of-known-web-crawlers – Christa Jan 19 '11 at 15:56
2 Answers
1
You probably won't find anything suitable for PHP, as it is generally for short-running pages. Many severs, for example, are set to time out at 30 seconds. You can write PHP for command-line scripts, but I suspect that's not what you want.
Anywyay, if you want a pre-packaged solution, why care about the language?
I would recommend something like wget to crawl the sites and save them to disc. Then you can iterate over the files and directories, and pull out links. The hard bit is crawling the sites (it's not simple). You can write the code to pull out links without too much difficulty.

Joe
- 46,419
- 33
- 155
- 245