Why is my text file larger than my binary file?

Question

I'm trying to write a large text file to a binary file, but the binary file is the same size as my text file. I thought that writing to a binary file would compress it? Is writing to a binary file just more efficient? How can I minimize the storage of my text file for use?

ArrayList<String> strArr = new ArrayList<String>();
File f = new File("words.txt");
BufferedInputStream in = new BufferedInputStream(new FileInputStream(f));
  
DataOutputStream out = new DataOutputStream (
                       new BufferedOutputStream(
                       new FileOutputStream("word.ser")
                    
                       )); 
                       
byte[] buffer = new byte[8192]; // or more, or even less, anything > 0
int count;
while ((count = in.read(buffer)) > 0) {
  out.write(buffer, 0, count);
}
in.close();
out.close();
/*ObjectOutputStream oos = new ObjectOutputStream(
                         new BufferedOutputStream(
                         new FileOutputStream("words.ser")

                         )); */
System.out.println(f.length());
File file = new File("words.ser");
System.out.println(file.length());

Why would writing bytes to a file compress it? Compression only happens if there is code to do the compression. You don't have any such code. --- Also, you're reading your text file as a binary file, so you're simply doing a binary copy of the file. — Andreas, Sep 20 '20 at 00:50
All information is binary. "Text" as it is referred to is simply a description of how the internal bits are organized to reflect readable characters. If you try and print an executable file, some of them will be interpreted as text even thought they probably aren't. — WJS, Sep 20 '20 at 00:50
the file size being smaller has nothing to do with compression though, compression is when an algorithm uses optimized techniques to save data, read this for more information about compression: https://www.techopedia.com/definition/892/file-compression#:~:text=File%20compression%20is%20a%20data%20compression%20method%20in,a%20size%20substantially%20smaller%20than%20the%20original%20file. — RIVERMAN2010, Sep 20 '20 at 00:52
@RIVERMAN2010 if you write the alphabet to a text file and look at its size it will be 26. I never heard of the *eof* char being use for this. Typically the end of the file is determined by the driver when it hits the end as described by a File Attribute Table entry or some other equivalent in a different file system. — WJS, Sep 20 '20 at 00:56
Your title says the 'text' file is larger than the 'binary' file; the body of the question says they're the same size. Confusion? — J.Backus, Sep 20 '20 at 01:32

score 3 · Answer 1 · answered Sep 20 '20 at 00:56

To compress a file, you can e.g. gzip it.

In Java, you can do that like this:

Path inFile = Paths.get("words.txt");
Path outFile = Paths.get("words.txt.gz");
try (OutputStream out = new GZIPOutputStream(Files.newOutputStream(outFile))) {
    Files.copy(inFile, out);
}

score 3 · Accepted Answer · answered Sep 20 '20 at 01:43

You're confused.

There's no such thing as a 'text' file or a 'binary' file, at least, to a harddisk / a filesystem. It's a bag of bytes. They all are. Just.. bytes.

Now, if the bytes so happen to form a sequence that, say, Microsoft Word will correctly read in if you pick that file from its 'file open' menu, we may say 'this is a Word file'. The filesystem cares absolutely nothing whatsoever for such frivolous human things. It was asked to provide the bytes in a file named 'foo.doc' and it did so. It did so in the exact, precise same fashion it would have done had word asked the filesystem to give it the bytes from 'foo.txt' or 'foo.jpg'. It's up to word to crash if the bytes don't make sense to it.

So, what's a 'text file'. Same deal applies: if a text editing tool asks the file system to open a file, and it 'works', I guess we can call it a text file. To the file system, it's.. just a file.

And now you know why sending the file as an OutputStream or as a BufferedWriter or what not makes no difference. That's just modifying the precise mechanism by which the characters end up in byte form. Assuming it's simple ASCII characters, it's 1 byte per character, simple as that.

If you want it to be smaller, you'd have to use compression algorithms, like gzip. Note that, obviously, random data cannot be compressed. The only amount of 'compression' you get is the amount of non-entropy inherent in the data that your compression algorithm can manage to find and code into a more efficient form. The other answer shows one easy way to do this.

Why is my text file larger than my binary file?

2 Answers2