0

I am using Webharvest to download a file from a website and take its original name.

The Java code that I am working with is:

import org.apache.commons.httpclient.Header;
            import org.apache.commons.httpclient.HttpClient;
            import org.apache.commons.httpclient.HttpStatus;
            import org.apache.commons.httpclient.Header;
            import org.apache.commons.httpclient.methods.GetMethod; 

            HttpClient client = new HttpClient();

            BufferedReader br = null;
            StringBuffer result = new StringBuffer();
            String attachName;

            GetMethod method = new GetMethod(attachmentLink.toString());

            int returnCode; 
            returnCode = client.executeMethod(method);
            Header[] headers = method.getResponseHeader("Content-Disposition");
            attachName = headers[0].getValue();
            attachName = new String(attachName.getBytes());

The result in webharvest is:

attachment; filename="Resoluci�n sobre Mesas de Contrataci�n.pdf"

I cant make it take the letter

ó

After I got the value of the header Content-Disposition into variable attachName, I also tried to decode it, but with no luck:

String attachNamef = URLEncoder.encode(attachName, "ISO-8859-1"); 
                      attachNamef = URLEncoder.decode(attachNamef, "UTF-8");

I was able to determine that the response charset is: ISO-8859-1

method.getResponseCharSet()

P.S. When I see the headers in Firefox Firebug - the value is ok: Content-Disposition

attachment; filename="Resolución sobre Mesas de Contratación.pdf"

Julian Reschke
  • 40,156
  • 8
  • 95
  • 98
linderman
  • 149
  • 1
  • 9
  • Note that the response charset refers to the payload, not the header fields. Also note that you seem to be using a very obsolete version of the HTTP components. Finally, the server response is invalid; non-ASCII characters are not allowed here; see RFC 6266. – Julian Reschke Jan 16 '17 at 18:31

1 Answers1

1

Apache HttpClient doesn't support non-ascii characters in HTTP headers. Taken from documentation:

The headers of a HTTP request or response must be in US-ASCII format. It is not possible to use non US-ASCII characters in the header of a request or response. Generally this is not an issue however, because the HTTP headers are designed to facilite the transfer of data rather than to actually transfer the data itself. One exception however are cookies. Since cookies are transfered as HTTP Headers they are confined to the US-ASCII character set. See the Cookie Guide for more information.

bsiamionau
  • 8,099
  • 4
  • 46
  • 73