I am creating a Spring application and I have the need to integrate with Wikipedia. In particular, I would like to extract data on a given (large) set of Cities, e.g. country, website and coordinates.
I am trying to understand which libraries or frameworks can be useful, but the big issue I am dealing with is that there is no reference structure for the pages I would like to extract information from. The main problem is not that some information is missing, which would be totally acceptable, but rather the city representation changes from city to city. E.g. the DBPedia ontology (e.g. http://dbpedia.org/ontology/City) does not reflect what I can extract via SPARQL query from dbpedia.org/sparql. This way, I don't know how to extract the data I need systematically (i.e. for my entire set).
Is there any technology that can support my task of extracting data on a predefined set of cities?
One solution could be to put in place some Natural Language Processing in order to seek for the required info in the entire Wikipedia page, but that requires a lot of effort, if I have to write it on my own. Another solution could be leveraging a source of structured data that pre-processed Wikipedia for me and gave some structure to the contained information, but I could not find one. On third solution could be trying to make different queries to Wikipedia, but I cannot figure out a way to extract the information I need via those Wikipedia APIs.