23

I need to encode a String to byte array using UTF-8 encoding. I am using Google guava, it has Charsets class already define Charset instance for UTF-8 encoding. I have 2 ways to do:

  1. String.getBytes( charsetName )

    try {        
        byte[] bytes = my_input.getBytes ( "UTF-8" );
    } catch ( UnsupportedEncodingException ex) {
    
    }
    
  2. String.getBytes( Charset object )

    // Charsets.UTF_8 is an instance of Charset    
    
    byte[] bytes = my_input.getBytes ( Charsets.UTF_8 );
    

My question is which one I should use? They return the same result. For way 2 - I don't have to put try/catch! I take a look at the Java source code and I see that way 1 and way 2 are implemented differently.

Anyone has any ideas?

LHA
  • 9,398
  • 8
  • 46
  • 85
  • Do you get equivalent results from both? If so, I would favor the latter case. If not, you need to decide which you consider to be correct. – merlin2011 Apr 26 '14 at 21:35
  • Yes, they return the same result. But my concern is why they are implemented differently? Why way 1 will not call way 2 internally? – LHA Apr 26 '14 at 21:37
  • @Loc What makes you think the former isn't calling the latter internally? (or, that they both wouldn't be calling some other common internal method?) http://www.docjar.com/html/api/java/lang/String.java.html lines 951 - 980 – Brian Roach Apr 26 '14 at 21:42
  • @BrianRoach Roach They call StringCoding.encode but the way 1 call this method with first parameter is String, way 2 call this method with the first parameter is Charset instance. If we take a look at this method ( 2 version ), they are implemented differently. – LHA Apr 26 '14 at 21:46

4 Answers4

25

If you are going to use a string literal (e.g. "UTF-8") ... you shouldn't. Instead use the second version and supply the constant value from StandardCharsets (specifically, StandardCharsets.UTF_8, in this case).

The first version is used when the charset is dynamic. This is going to be the case when you don't know what the charset is at compile time; it's being supplied by an end user, read from a config file or system property, etc.

Internally, both methods are calling a version of StringCoding.encode(). The first version of encode() is simply looking up the Charset by the supplied name first, and throwing an exception if that charset is unknown / not available.

Brian Roach
  • 76,169
  • 12
  • 136
  • 161
  • No. Internally, they call StringCoding.encode() but there are two version of StringCoding.encode(). The way 1 call this method with first parameter is charsetName, way2 call this method with first parameter is Charset instance. 2 version of StringCoding.encode() are implemented differently. I don't know why. – LHA Apr 26 '14 at 21:51
  • Sorry, I'll edit to clarify - the lookup is happening in `encode()` – Brian Roach Apr 26 '14 at 21:54
12

The first API is for situations when you do not know the charset at compile time; the second one is for situations when you do. Since it appears that your code needs UTF-8 specifically, you should prefer the second API:

byte[] bytes = my_input.getBytes ( Charsets.UTF_8 ); // <<== UTF-8 is known at compile time

The first API is for situations when the charset comes from outside your program - for example, from the configuration file, from user input, as part of a client request to the server, and so on. That is why there is a checked exception thrown from it - for situations when the charset specified in the configuration or through some other means is not available.

Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
4

Since they return the same result, you should use method 2 because it generally safer and more efficient to avoid asking the library to parse and possibly break on a user-supplied string. Also, avoiding the try-catch will make your own code cleaner as well.

The Charsets.UTF_8 can be more easily checked at compile-time, which is most likely the reason you do not need a try-catch.

merlin2011
  • 71,677
  • 44
  • 195
  • 329
3

If you already have the Charset, then use the 2nd version as it's less error prone.

Andres
  • 10,561
  • 4
  • 45
  • 63