2

I'm looking to obtain Wikipedia article titles and locations (lat/long) over a large area, too large for an individual url query such as:

https://en.wikipedia.org/w/api.php?action=query&list=geosearch&gsradius=6000&gscoord=51.967818|-3.290105

OR

http://api.geonames.org/wikipediaBoundingBox?north=44.1&south=-9.9&east=-22.4&west=55.2&username=demo

(This second one is better as it returns a bounding box whereas the first one returns a radius around a point, however it can give the result in json (by adding '&format=json') whereas the second cannot).

I wouldn't have a problem if there was not a limit in the search area of the query, or a limit in the number of result it returns. Are there way of getting around this?

So I'm looking for help to find a good way of automating this procedure to make lots of queries of bounding boxes in a grid-like fashion, parse the data, perhaps using python, and store it in my database.

This is some code I've come up with, but I'm stuck:

url = 'http://api.geonames.org/wikipediaBoundingBox?north=%s&south=%s&east=%s&west=%s&username=demo'

data_coords = [
{north : 51.990, 51.990, 51.990},
 south : 51.917, 51.917, 51.917},
 east : -3.247, -3.117, -2.987},
 west : -3.377, -3.247, 3.117}
]

for i in data_coords:

urllib2.urlopen(url % (i['north']), (i['south']), (i['east']), (i['west']))

Help would be appreciated, thanks!

James
  • 21
  • 1

1 Answers1

1

I love the question. Hope this helps:

def get_grids(area, divisions):
    if divisions:
        # left top
        get_grids([area[0], area[1], area[2]- (area[2] - area[0]) / 2, area[3] - (area[3] - area[1]) / 2], divisions - 1)
        # right top
        get_grids([area[0], area[1] + (area[2] - area[1]) / 2, area[2] - (area[2] - area[0]) / 2, area[3]], divisions - 1)
        # left bottom
        get_grids([area[0] + (area[2] - area[0]) / 2, area[1], area[2], area[3] - (area[3] - area[1]) / 2], divisions - 1)
        # right bottom
        get_grids([area[0] + (area[2] - area[0]) / 2, area[1] + (area[2] - area[1]) / 2, area[2], area[3]], divisions - 1)
    else:
        #request area here
        print(area)

# north, east, south, west
main_area = [10.0, 10.0, 20.0, 20.0]

get_grids(main_area, 1)

You'll have to input the main_area which is your starting area. After that you can do the rest call where the print is now.

For example, for input: main_area = [10.0, 10.0, 20.0, 20.0]

and 2 divisions(each division is ^2)

it outputs:

[10.0, 10.0, 12.5, 12.5]
[10.0, 12.5, 12.5, 15.0]
[12.5, 10.0, 15.0, 12.5]
[12.5, 12.5, 15.0, 15.0]
[10.0, 15.0, 12.5, 17.5]
[10.0, 15.0, 12.5, 20.0]
[12.5, 15.0, 15.0, 17.5]
[12.5, 15.0, 15.0, 20.0]
[15.0, 10.0, 17.5, 12.5]
[15.0, 15.0, 17.5, 15.0]
[17.5, 10.0, 20.0, 12.5]
[17.5, 15.0, 20.0, 15.0]
[15.0, 15.0, 17.5, 17.5]
[15.0, 17.5, 17.5, 20.0]
[17.5, 15.0, 20.0, 17.5]
[17.5, 17.5, 20.0, 20.0]
Martin Gottweis
  • 2,721
  • 13
  • 27
  • Thanks for the reply. Hmm I'm not sure I understand what you mean, how does this like to the url query? – James May 19 '16 at 23:03
  • I figured the question was how to get 8 small bounding boxes out of 1 large bounding box. Is your question more related to how to actually get the json? – Martin Gottweis May 20 '16 at 06:10
  • Not quite, I want to get several more bounding boxes of a similar size. The list of coords I've given shows just 3 of these, but I'm looking to do 50+ – James May 20 '16 at 21:22
  • thats exactly what my code generates out of one large bounding box. get_grids(main_area, 3) will get you 64 bounding boxes. – Martin Gottweis May 21 '16 at 05:30