Read and download a page's source as Unicode in Java

Question

Right now, I have some code that reads a page and saves everything to an html file. However, there are some problems... some punctuation and special characters show up as question marks.

Of course, if I do this manually, I'd save the .txt file with Unicode encoding rather than the default ANSI. I looked around, and all I see about this is complaining that it's impossible in Java or half explanations that I don't understand...

In any case, can anyone help me correct the question marks? Here is the part of my code that downloads the page. (The lister creates an array of urls to download, to be used with sites with pages. You can ignore that, it works fine.)

public void URLDownloader(String site, int startPage, int endPage) throws Exception {
String[] pages = URLLister(site, startPage, endPage);
String webPage = pages[0];
int fileNumber = startPage;
if (startPage == 0)
  fileNumber++;

//change pages
for(int i = 0; i < pages.length; i++) {
  webPage = pages[i]; 
  URL url= new URL(webPage);
  BufferedReader in = new BufferedReader( 
                                         new InputStreamReader(url.openStream()));
  PrintWriter out = new PrintWriter(name + (fileNumber+i) + ".html");
  String inputLine;


  //while stuff to read on current page
  while ((inputLine = in.readLine()) != null) {
    out.println(inputLine); //write line of text
  }
  out.close();    //end writing text
  if (startPage == 0)
    startPage++;
  console.append("Finished page " + startPage + "\n");
  startPage++;
}

What language is this? Java? – John Saunders Jul 16 '13 at 05:25 — John Saunders, Jul 16 '13 at 05:25
Whoops, sorry. Yes, it's Java. Edited the title. – user2585824 Jul 16 '13 at 05:26 — user2585824, Jul 16 '13 at 05:26
Better - I added the [tag:java] tag. – John Saunders Jul 16 '13 at 09:24 — John Saunders, Jul 16 '13 at 09:24

score 2 · Accepted Answer · edited May 23 '17 at 11:45

if I do this manually, I'd save the .txt file with Unicode encoding rather than the default ANSI

Windows is giving you misleading terminology here. There is no such encoding as ‘Unicode’; Unicode is the character set which is encoded in different ways into bytes. The encoding that Windows calls ‘Unicode’ is actually UTF-16LE. This is a two-byte-per-code-unit encoding that is not ASCII compatible and is generally inconvenient; Web pages tend not to work well with it.

(For what it's worth the ‘ANSI’ code page isn't anything to do with ANSI either. Plus ça change...)

PrintWriter out = new PrintWriter(name + (fileNumber+i) + ".html");

This creates a file using the Java default encoding, which is likely the ANSI code page in your case. To specify a different encoding, use the optional second argument to PrintWriter:

PrintWriter out = new PrintWriter(name + (fileNumber+i) + ".html", "utf-8");

UTF-8 is usually a good choice: being a UTF it can store all Unicode characters, and it's ASCII-compatible too.

However! You are also reading in the string using the default encoding:

BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));

which probably isn't the encoding of the page. Again, you can specify the encoding using an optional parameter:

BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream(), "utf-8"));

and this will work fine if the web page was actually served as UTF-8.

But what if it wasn't? There are actually multiple ways the encoding of an HTML page can be determined:

From the Content-Type: text/html;charset=... header parameter, if present.
From the <?xml declaration, if it's served as application/xhtml+xml.
From the <meta> equivalent tag in the page, if (1) and (2) were not present.
From browser-specific guessing heuristics, which may depend on user settings.

You can get (1) by reading URL.getConnection().getContentType() and parsing out the parameter. To get (2) or (3) you have to actually parse the file, which is kind of bad news. (4) is out of reach.

Probably the most consistent thing you can do is just what web browsers (except IE) do when they save a standalone web page to disc: take the exact original bytes that were served and put them straight into a file without any attempt to decode them. Then you don't have to worry about encodings or line ending changes. It does mean any charset metadata in the HTTP headers gets lost, but there's not really much you can do about that short of parsing the HTML and inserting a <meta> tag yourself (probably far too much faff).

InputStream in = url.openStream();
OutputStream out = new FileOutputStream(name + (fileNumber+i) + ".html");

byte[] buffer = new byte[1024*1024];
int len;
while ((len = in.read(buffer)) != -1) {
    out.write(buffer, 0, len);
}

(nb buffer copy loop from this question which offers alternatives such as IOUtils.)

Read and download a page's source as Unicode in Java

1 Answers1