4

Say we have project that requires web scraping. (parsing strings (< 40) and scraping web pages (geting meta datas and such) I am aware of that perl has great and suited cpan modules for this job, so i can take that way and don't bother myself that much. But i don't have a clue about speed and memory related stuff.

So, which would you choose? (May be Python??) And in terms of speed which one is better for this job? Explain please...

Thanks in advance.

wonnie
  • 459
  • 3
  • 6
  • 19
  • 1
    AFAIK both Python and Perl are better than PHP in terms of performance. – fabrik Apr 04 '11 at 12:25
  • Assambler - best perfomance guaranteed. You need to analyze your task and check, if the speed given realization is enough. Question is to abstract. – Silver Light Apr 04 '11 at 12:27
  • 1
    @Silver Light, I have never been able to optimise assembler more than my C compiler. – tster Apr 04 '11 at 12:33
  • 1
    Only reason why Python is (initially) faster than PHP is because Python is byte-compiled to .pyc files. The same thing happens with PHP if you are to use APC where the interpreter isn't invoked every time a php script is requested. Using APC increases PHP performance many times, to the point it is actually faster in some tasks (I said some because I haven't tested every single feature of both languages and compared them). It's naive to claim Python is faster, because all the facts aren't mentioned. Now, what to use - what language do you know the best? Use that one. – Michael J.V. Apr 04 '11 at 12:47

3 Answers3

4

Use Perl or Python. Both have tons of libraries for web scraping.

In Python you could use BeautifulSoup to parse even the crappy kind of HTML lots of pages like to use.

ThiefMaster
  • 310,957
  • 84
  • 592
  • 636
  • HTML::TreeBuilder ( http://search.cpan.org/perldoc?HTML::TreeBuilder ) is a great library for digging into specific bits of a web page. It is built on the amazing HTML::Parser ( http://search.cpan.org/perldoc?HTML::Parser ) library which devours horrible "HTML" with aplomb. – daotoad Apr 06 '11 at 03:46
3

I once successfully used Perl with WWW-Mechanize in such a context. Hopefully you don't need to evaluate .js.

mbx
  • 6,292
  • 6
  • 58
  • 91
1

I would go with perl... I haired a rumor that was the language that google used initially... Python is a good performance language as well.

maazza
  • 7,016
  • 15
  • 63
  • 96
fingerman
  • 2,440
  • 4
  • 19
  • 24