0

I want to parse an HTML document from a URL in Java.

When I enter the url in my browser (chrome) it doesn't display the html page but it downloads it.

So the url is the link behind a "download" button on a webpage. No problem so far. The url is "https://www.shazam.com/myshazam/download-history" if I paste it in my browser, it downloads fine. But when I try to download it with java, I get a 401 (forbidden) error.

I Checked the chrome network tool when loading the url and noticed that my profile-data and registration cookies where passed with the http GET.

I tried a lot of different methods but none worked. So my question is, how do I produce this with java? How can I get (download) the HTML file and parse it?

update:

This is what we found so far (thanks to Andrew Regan):

BasicCookieStore store = new BasicCookieStore();
store.addCookie( new BasicClientCookie("profile-data", "value") );  // profile-data
store.addCookie( new BasicClientCookie("registration", "value") );  // registration
Executor executor = Executor.newInstance();
String output = executor.use(store)
            .execute(Request.Get("https://www.shazam.com/myshazam/download-history"))
            .returnContent().asString();

The last line of code seems to cause a NullPointerException. The rest of the code seems to work fine to load non-protected webpages.

Toon Van Eyck
  • 69
  • 1
  • 10

3 Answers3

3

I found the answer myself. Using HttpURLConnection, this method can be used to "authenticate" to a variety of services. I used chrome's build in networking tools to get the cookie values of the GET request.

HttpURLConnection con = (HttpURLConnection) new URL("https://www.shazam.com/myshazam/download-history").openConnection();
con.setRequestMethod("GET");
con.addRequestProperty("Cookie","registration=Cooki_Value_Here;profile-data=Cookie_Value_Here");
BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
String inputLine;
    while ((inputLine = in.readLine()) != null) 
    System.out.println(inputLine);
    in.close();
Toon Van Eyck
  • 69
  • 1
  • 10
0

So if you delete those cookies/use a private session, the browser should reproduce what you are seeing in code.

I'm guessing you need to first go to "http://www.shazam.com/myshazam" and log in.

David Cain
  • 166
  • 6
  • The problem with shazam is, you just need to enter your email address to log in and then open the mai in your browser. Is not the standard Username/Password kinda deal. – Toon Van Eyck Mar 31 '16 at 11:57
0

You could try just add the cookie values to a GET request using, for example, the HttpClient Fluent API:

CookieStore store = new BasicCookieStore();
store.addCookie( new BasicClientCookie(name, value) );  // profile-data
store.addCookie( new BasicClientCookie(name, value) );  // registration

Executor executor = Executor.newInstance();
String output = executor.cookieStore(store)
        .execute(Request.Get("https://www.shazam.com/myshazam/download-history"))
        .returnContent().asString();

To parse you could then do:

Element dom = Jsoup.parse(output);
for (Element element : result.select("tr td")) {
    String eachCellValue = element.text();
    // Whatever
}

(You didn't give any more detail than that)

Andrew Regan
  • 5,087
  • 6
  • 37
  • 73