4

I am creating a Spring application and I have the need to integrate with Wikipedia. In particular, I would like to extract data on a given (large) set of Cities, e.g. country, website and coordinates.

I am trying to understand which libraries or frameworks can be useful, but the big issue I am dealing with is that there is no reference structure for the pages I would like to extract information from. The main problem is not that some information is missing, which would be totally acceptable, but rather the city representation changes from city to city. E.g. the DBPedia ontology (e.g. http://dbpedia.org/ontology/City) does not reflect what I can extract via SPARQL query from dbpedia.org/sparql. This way, I don't know how to extract the data I need systematically (i.e. for my entire set).

Is there any technology that can support my task of extracting data on a predefined set of cities?

One solution could be to put in place some Natural Language Processing in order to seek for the required info in the entire Wikipedia page, but that requires a lot of effort, if I have to write it on my own. Another solution could be leveraging a source of structured data that pre-processed Wikipedia for me and gave some structure to the contained information, but I could not find one. On third solution could be trying to make different queries to Wikipedia, but I cannot figure out a way to extract the information I need via those Wikipedia APIs.

Manu
  • 4,019
  • 8
  • 50
  • 94
  • Your question asks for wikipedia, but then you give an example of dbpedia? Unless they are the same thing (source-wise)? – MxLDevs Jul 24 '14 at 15:45
  • I report on dbpedia because it is the closest technology I found to interact with Wikipedia in order to accomplish my task. I am also looking for different options. – Manu Jul 24 '14 at 15:51
  • Why do you need the ontology? Can't you just use the data themselves? – svick Jul 25 '14 at 01:23

2 Answers2

5

Data from Wikipedia is being transfered to Wikidata. Using their API you could get what you want. If you want a shortcut you could use the Wikidata query tool: http://wdq.wmflabs.org/api_documentation.html

Ainali
  • 1,613
  • 13
  • 23
0

Im not a java guy, but I did something like this in .Net.

You need some kind of web scraping framework.

In .Net there is HtmlAgilityPack. Where you get the site and then with fx XPATH go through elements of the sites. Offcourse you need to know where on the site the informations is. That could be the tags around the heading, text and so on.

For java, the framework I just found was

  • Tag Soup
  • HtmlUnit
  • Web-Harvest
  • jARVEST
  • jsoup
  • Jericho HTML Parser
Jakobbbb
  • 515
  • 1
  • 4
  • 10