-3

I want to make one project that parse wiki pages and get needed information from it.i check some crawler and dom parser like nutch apache crawler and simple dom parser.Parsing wiki page with core php is very slow.

But i cant get from

  • which tools can i use for best optimise result?

  • how to integrate nutch like crawler with php?

  • how to store data in mysql that fetch from crawler ?

  • How to organize data that fetch from crawler ?

  • which level of regular expression i have to learn ?

I am new in crawler kind of project .

Thanx in advance for your priceless time. Dont know why people closed my question.please reopen it.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
sandeep
  • 2,244
  • 1
  • 23
  • 38
  • This question is much to broad. Some parts of the question are probably off topic, some parts might be possible to re-post as individual questions. – Christofer Eliasson Mar 19 '12 at 11:15
  • What has your research turned up so far? – halfer Mar 19 '12 at 11:15
  • 1
    Question seems interesting my friend.. – Paresh Balar Mar 19 '12 at 11:19
  • @Christofer i know this is broad but i write all the specification so that answerer can answer perfectly – sandeep Mar 19 '12 at 11:19
  • @halfer from wikipages i have to get title of page and some data mining type information – sandeep Mar 19 '12 at 11:20
  • @sandeep - I understand that. But what code have you got so far? Or have you found any PHP libraries on search engines that might be useful? Share with us `:)` – halfer Mar 19 '12 at 11:25
  • I found simple php crawler that is out of date plus very slow..i install nutch crawler that is in java but cant configure with php – sandeep Mar 19 '12 at 11:26
  • Cool, add hyperlinks to your question? – halfer Mar 19 '12 at 11:31
  • @halfer which kind of hyperlink? – sandeep Mar 19 '12 at 11:34
  • @halfer : what you know about this parsing HTML pages and scraping information ? – Paresh Balar Mar 19 '12 at 11:37
  • @sandeep - I meant add hyperlinks to your research, to reassure people that you've studied the problem thoroughly before bringing it here (which you've now done - thanks). In general, a code example is good too - if you keep problems highly specific, they won't be closed. – halfer Mar 19 '12 at 12:19
  • @halfer thanx but some people closed my question :( – sandeep Mar 19 '12 at 12:20
  • For example, you could say "I've tried using SimpleXML and XMLReader to access the Media Wiki API, but it didn't let me drill down into actor information". But your question was about scraping, database design, optimisation, and regular expressions! It's too much for a single question. Break it down into parts, do the designing yourself, and ask individual how-to questions here (e.g. "I've got an HTML document scraped using file_get_contents() but it won't parse as XML. What is the best strategy/library to isolate actor name information from the text? I am using PHP 5.2 on Linux". – halfer Mar 19 '12 at 12:25
  • Yep, the question is closed. But you should now have a better idea of what research/design you need to do, and if you run into a _specific_ technical problem, ask a new question `:)` – halfer Mar 19 '12 at 12:26
  • **This is not data-mining**. I replaced the tag with the more appropriate tag [tag:web-scraping]. Data mining refers to a special type of *statistical analysis* of data, not just information extraction (otherwise it would be called [tag:information-extraction]). – Has QUIT--Anony-Mousse Mar 19 '12 at 20:14

1 Answers1

2

There is a built in media wiki API thats available on wikipedia and there are some PHP examples on usage

The web service API provides direct, high-level access to the data contained in MediaWiki databases. Client programs can log in to a wiki, get data, and post changes automatically by making HTTP requests to the web service.

Manse
  • 37,765
  • 10
  • 83
  • 108
  • 1
    Ya.i install mediawiki .but its only for developing site like wikipedia but i have to get useful info from page.I dont want whole page , i want some data of page – sandeep Mar 19 '12 at 11:39
  • @sandeep I dont understand sorry .. what "data" – Manse Mar 19 '12 at 11:42
  • i want data like actor and his birthdate from page – sandeep Mar 19 '12 at 11:43
  • 1
    @sandeep you can get the data you need its at the top ... [see this example](http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=john_Resig&rvprop=content) – Manse Mar 19 '12 at 11:50
  • Thanx @ManseUK for your effort.My Question closed by many people – sandeep Mar 19 '12 at 11:51