11

I want to encode a file it may be image or any pdf and send it to server. Which type of Encoding and decoding I have to follow. (Both server and client is in our company. we can write logic in both place). UTF-8 Encoding is by default supported in java. and to use Base-64 encoding I have to import external jar. for simple texts both the ways are working fine. I am using tcp socket programming.

Using UTF-8 Encoding

String str = "This is my Sample application";
        String urlEncodedData = URLEncoder.encode(str, "UTF-8"); // Encoding with UTF-8
        System.out.println("..after URL Encodingencoding..."+urlEncodedData );
        String retrievedData = URLDecoder.decode(urlEncodedData , "UTF-8");// Decoding with UTF-8
        System.out.println("..after decoding..."+retrievedData ); 

Using Base-64 (Using commons.codec jar of apache

byte[] b =Base64.encodeBase64(str.getBytes()); //Encoding  base 64
        Base64.decodeBase64(b); // Decoding with Base 64
  • 8
    You're comparing apples and pears. Base64 is just a number base in which to express data. UTF-8 is an encoding scheme that encodes numbers (thought of as codepoints) in a byte stream. – Kerrek SB Jul 22 '11 at 15:09
  • See the question [here](http://stackoverflow.com/q/3866316/3009). It's tagged as C# but the encoding information applies the same way. – highlycaffeinated Jul 22 '11 at 15:10
  • 1
    Why do you want/need to encode the binary files (PDF and images)? Can't you just send it to the server? – Arsen7 Jul 22 '11 at 15:11
  • It is a not only abot pdf, i have image files also. if the file is big i am sending the file in chunkwise –  Jul 22 '11 at 15:29

1 Answers1

56

UTF-8 is a text encoding - a way of encoding text as binary data.

Base64 is in some ways the opposite - it's a way of encoding arbitrary binary data as ASCII text.

If you need to encode arbitrary binary data as text, Base64 is the way to go - you mustn't try to treat arbitrary binary data as if it's UTF-8 encoded text data.

However, you may well be able to transfer the file to the server just as binary data in the first place - it depends on what transport you're using.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • I am using tcp socket programming. –  Jul 22 '11 at 15:14
  • 1
    @Deepakkk: Well I'm sure you're using *some* protocol that's slightly higher level than that... depending on what the application protocol is, you may or may not need to perform binary to text encoding. – Jon Skeet Jul 22 '11 at 15:16
  • @JonSkeet Why can't we try to treat arbitrary binary data as if it's UTF-8 while Base64 is assuming the bytes are encoded in ASCII? – sarahTheButterFly Feb 28 '13 at 03:59
  • 4
    @sarahTheButterFly: Not every byte sequence is valid UTF-8-encoded text. There are rules around what's allowed - look up the Wikiedia UTF-8 article to find out the details. Even if every byte sequence *were* valid, you'd find that a lot of the characters produced might be hard to transmit over many transports, whereas Base64 uses only non-control characters within ASCII, which are generally easy to transmit. – Jon Skeet Feb 28 '13 at 06:52
  • @JonSkeet "UTF-8 is a text encoding - a way of encoding **text as binary data** " On running UTF-8 encoding in Python `"test".encode("utf-8")`, it returns a byte stream, so I deduce UTF-8 encoding is used for *String to Bytes encoding* as opposed to *Text to Binary*. I see File modes (Text, Binary) to be very different from String, Bytes encoding; Correct me if am wrong. – CᴴᴀZ Jul 26 '18 at 07:47
  • 1
    @CᴴᴀZ What do you see the difference as? String is a type representing text. Bytes are binary data. "String to bytes" and "Text to binary" are the same thing. – Jon Skeet Jul 26 '18 at 12:03
  • @JonSkeet `Text, Binary` as in case of Files and `Bytes` as in case of streams. For example, [a Text file can be read as a Byte stream](https://www.mkyong.com/java/how-to-convert-file-into-an-array-of-bytes/) but a Binary file can't be read as a Char (String) stream. I see them to be *similar* but I doubt that they are the *same thing*, hence need your help. :) [Ref. on Text, Binary and Byte streams.](https://msdn.microsoft.com/en-us/library/cx3c1zs4.aspx) – CᴴᴀZ Jul 26 '18 at 12:43
  • @CᴴᴀZ There's no such thing as a "text file" distinct from a"binary file". All files are just binary streams on disk, *possibly* with supporting metadata depending on the file system. A single file *could* be a valid image in some format as well as a meaningful text file, for example. – Jon Skeet Jul 26 '18 at 17:30
  • Could it make sense to encode everything as base 64 at the end, including text that was already UTF-8 encoded, as a way to make sure it can be transmitted correctly across the protocols and there won't be any control characters interpreted strangely? I mean, if UTF-8 encodes text as binary, at that point you have binary data, which as you say, may not be safe to transmit. So then you take that binary data and base 64 encode it before sending it. Then at the other end, it would first base 64 decode it, then treat it as UTF-8 encoded. Might this be necessary or is it over-engineering? – still_dreaming_1 Nov 22 '21 at 18:50