query a webpage for variable entity using a consistent URL structure

Question

Would someone please help me to understand how I might inject into my program a query to this webpage?

There are two parameters that need to be set, i.e.

"Site:", is where you enter the language and site code.

&

"Page:", you must put in the exact title of the page as it appears on the connected site.

The URL's always look like this:

https://www.wikidata.org/wiki/Special:ItemByTitle?site=en&page=Mikhail+Bakunin&submit=Search

https://www.wikidata.org/wiki/Special:ItemByTitle?site=en&page=Thomas+Edward+Lawrence&submit=Search

and the language is always English, so you see, it's just:

https://www.wikidata.org/wiki/Special:ItemByTitle?site=en&page=Blah+Blah&submit=Search

The objective of querying that page is to retrieve the ID value associated with the page, so for Mikhail Bakunin it's Q27645 and for T. E. Lawrence it's Q170596

It becomes part of the URL once the page is reached:

https://www.wikidata.org/w/index.php?title=Q170596&site=en&page=Thomas+Edward+Lawrence&submit=Search

But also maybe I could strip it from the page, using beautifulSoup or soemthng?(that's a guess)

The program needs to be generalizable, which is to say, that the name of the entity we're searching for is variable, it will change in the program, so that needs to be taken in account.

I guess using python or php or something would not be a crime against humanity if it's easier, though I prefer java.

update:

import java.net.*;
import java.io.*;

public class URLConnectionReader 
{
    public static void main(String[] args) throws Exception 
    {
        URL site = new URL("https://www.wikidata.org/wiki/Special:ItemByTitle?site=en&page=Mikhail+Bakunin&submit=Search");
        URLConnection yc = site.openConnection();
        BufferedReader in = new BufferedReader(
                                new InputStreamReader(
                                yc.getInputStream()));
        String inputLine;

        while ((inputLine = in.readLine()) != null) 
            System.out.println(inputLine);
        in.close();
    }
}

this works sort of, but the result is quite messy.

I guess I could grab it out of this thing:

<!-- wikibase-toolbar --><span class="wikibase-toolbar-container"><span class="wikibase-toolbar-item wikibase-toolbar ">[<span class="wikibase-toolbar-item wikibase-toolbar-button wikibase-toolbar-button-edit"><a href="/wiki/Special:SetSiteLink/Q27645">edit</a></span>]</span></span>

but how?

I'm not sure I understand... wouldn't a simple HTTP request to the first URL give you all the information? — jamp, Apr 17 '15 at 10:52
can you show me an example though or a reference I can look at? I have no experience with this — smatthewenglish, Apr 17 '15 at 10:56
https://www.google.com/search?q=java+http+request&gws_rd=cr,ssl&ei=1eYwVYrZLMORsAG92IH4Cw — jamp, Apr 17 '15 at 10:56

RobIII · Accepted Answer · 2015-04-17T15:02:57.717

When you request this URL the response is:

HTTP/1.1 302 forced.302
Server: Apache
X-Powered-By: HHVM/3.3.1
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Vary: Accept-Encoding,X-Forwarded-Proto,Cookie
X-Content-Type-Options: nosniff
Location: http://www.wikidata.org/w/index.php?title=Q27645&site=en&page=Mikhail+Bakunin&submit=Search
Content-Type: text/html; charset=utf-8
X-Varnish: 1641959068, 1690824779, 1606045625
Via: 1.1 varnish, 1.1 varnish, 1.1 varnish
Transfer-Encoding: chunked
Date: Fri, 17 Apr 2015 11:49:55 GMT
Age: 0
Connection: keep-alive
X-Cache: cp1054 miss (0), cp3003 miss (0), cp3013 frontend miss (0)
Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
Set-Cookie: GeoIP=NL:XXX:51.4400:5.6194:v4; Path=/; Domain=.wikidata.org

So there's a 302 redirect in the HTTP response headers. That's where you'll want to grab your Q-number. Simlpy regex it out of the Location header with a regex like:

^Location:.*?title=(Q[0-9]+)

...and use matchgroup 1 (should be Q27645).

To grab the HTTP headers, have a look at this page; it basically goes like:

URL obj = new URL("https://www.wikidata.org/wiki/Special:ItemByTitle?site=en&page=Mikhail%20Bakunin&submit=Search");
URLConnection conn = obj.openConnection();

//get header by 'key'
String location = conn.getHeaderField("Location");

//TODO: Regex here

query a webpage for variable entity using a consistent URL structure

1 Answers1