Selenium 2: Detect content type of link destinations

Question

I am using the Selenium 2 Java API to interact with web pages. My question is: How can i detect the content type of link destinations?

Basically, this is the background: Before clicking a link, i want to be sure that the response is an HTML file. If not, i need to handle it in another way. So, let's say there is a download link for a PDF file. The application should directly read the contents of that URL instead of opening it in the browser.

The goal is to have an application which automatically knows wheather the current location is an HTML, PDF, XML or whatever to use appropriate parsers to extract useful information out of the documents.

Update

Added bounty: Will reward it to the best solution which allows me to get the content type of a given URL.

Anders Lindahl · Accepted Answer · 2011-04-12T18:14:35.723

5

As Jochen suggests, the way to get the Content-type without also downloading the content is HTTP HEAD, and the selenium webdrivers does not seem to offer functionality like that. You'll have to find another library to help you with fetching the content type of an url.

A Java library that can do this is Apache HttpComponents, especially HttpClient.

(The following code is untested)

HttpClient httpclient = new DefaultHttpClient();
HttpHead httphead = new HttpHead("http://foo/bar");
HttpResponse response = httpclient.execute(httphead);
BasicHeader contenttypeheader = response.getFirstHeader("Content-Type");

System.out.println(contenttypeheader);

The project publishes JavaDoc for HttpClient, the documentation for the HttpClient interface contains a nice example.

edited Apr 12 '11 at 18:14

answered Apr 03 '11 at 08:15

Anders Lindahl

41,582
9
89
93

I have still issues with that piece of code. entity is always null even if the response is ok. – Alp Apr 12 '11 at 13:33
It could be that a HttpHead response doesn't contain a `HttpEntity`. I've changed the example to pick up the Content-type header from the response, still untested though. – Anders Lindahl Apr 12 '11 at 18:15

score 0 · Answer 2 · answered Mar 27 '11 at 16:50

0

You can figure out the content type will processing the data coming in. Not sure why you need to figure this out first. If so, use the HEAD method and look at the Content-Type header.

answered Mar 27 '11 at 16:50

Jochen Bedersdorfer

4,093
24
26

If i don't figure it out beforehand, it could happen that Firefox shows some Download Popup, which i want to avoid. – Alp Mar 27 '11 at 16:51
In that case, HEAD is the way to go. It gives you all the headers you would get from a GET call, without the actual content. – Jochen Bedersdorfer Mar 27 '11 at 16:58
I cant find the appropriate method to get the response header. Remeber, i am using Selenium 2. – Alp Mar 27 '11 at 17:08

score 0 · Answer 3 · answered Mar 31 '11 at 18:55

0

You can retrieve all the URLs from the DOM, and then parse the last few characters of each URL (using a java regex) to determine the link type.

You can parse characters proceeding the last dot. For example, in the url http://yoursite.com/whatever/test.pdf, extract the pdf, and enforce your test logic accordingly.

Am I oversimplifying your problem?

answered Mar 31 '11 at 18:55

rs79

2,311
2
33
39

I think this is too simple. Many URLs are like /generateImage.php?name=test which could be any graphics format. I think i need to somehow fetch the link contents itself. – Alp Mar 31 '11 at 21:17

Selenium 2: Detect content type of link destinations

3 Answers3

Linked