I need some help comparing different programming languages, such as: C++, Java, Python, Ruby and PHP, for a task which is related for web data mining (developing web crawler, string manipulations and etc.). I have a bit experience with PHP, and I think advantages that it has for this particular task are simple syntax, in-depth string parsing capabilities, networking functions, and portability, but don't know much about other languages and their advantages and disadvantages related for this particular task.
-
Different languages for doing what? Data mining on the web is a complicated task, and it's not clear what you'll be doing. In addition, it depends on your knowledge and experience, how much you're willing to learn, whether this needs to be professional quality, and quite a few other things. – David Thornley Nov 16 '09 at 20:36
-
I retagged your question to [tag:web-scraping] instead of [tag:data-mining] (which refers to the analysis, not the data extraction). – Has QUIT--Anony-Mousse Aug 09 '12 at 03:31
3 Answers
The specific language will not matter nearly as much as your familiarity. These days, all high-level languages will come with the basics. Unless you need it to be super-fast (you're probably going to be limited by download speed, not the speed that you parse the HTML) or have other constraints not listed, the language won't matter that much.
Just make sure that you use the libraries. In particular an HTML parsing library that is good with invalid markup (not an XML parser) and regular expressions where appropriate.

- 378
- 1
- 2
- 11
As a previous post implies - being familiar makes a big difference. I would also say look at what the language was originally designed to do - it gives a good idea of what its best at.
PHP - designed for server side scripting, not really ideal for this use.
Perl - Designed to pull text apart (good start) and excellent libraries - look at LWP and the modules under HTML such as HTML::Treebuilder - a good choice. Unrivalled selection of modules to plugin.
Python - A good choice, look at beautifulsoup and urllib
Ruby - also a good choice, look at hpricot a lot less mature than Perl or Python in terms of modules available.
I have written quite a bit of web spider/data mining software and have always used Perl. If I was starting from scratch today I might choose python.

- 11
- 1
Google's first crawler was written in Python 1.5
I'm no expert on other languages, but I would go with python and html5lib or Beautifulsoup.

- 19,354
- 16
- 71
- 103