6

If I want to store a number, let's say 56789 in a file, how many bytes will be required to store it in binary and text files respectively? I want to know how bytes are allocated to data in binary and text files.

Balthazar
  • 129
  • 4
  • 10
programmer
  • 57
  • 1
  • 5

5 Answers5

6

It depends on:

  • text encoding and number system (decimal, hexadecimal, many more...)
  • signed/not signed
  • single integer or multiple (require separators)
  • data type
  • target architecture
  • use of compressed encodings

In ASCII a character takes 1 byte. In UTF-8 a character takes 1 to 4 bytes, but digits always take 1 byte. In UTF-16 or Unicode it takes 2 or more bytes per character.

Non-ASCII formats may require additional 2 bytes (initial BOM) for the file, this depends on the editor and/or settings used when the file was created.

But let's assume you store the data in a simple ASCII file, or the discussion becomes needlessly complex.

Let's also assume you use the decimal number system.

In hexadecimal you use digits 0-9 and letters a-f to represent numbers. A decimal (base-10) like 34234324423 would be 7F88655C7 in hexadecimal (base-16). In the first system we have 11 digits, in the second just 9 digits. The minimum base is 2 (digits 0 and 1) and the common maximum base is 64 (base-64). Technically, with ASCII you could go as high as base-96 maybe base-100, but that's very uncommon.

Each digit (0-9) will take one byte. If you have signed integers, an additional minus sign will lead the digits (so negative numbers charge 1 additional byte).

In some circumstances you may want to store several numerals. You will need a separator to tell the numerals apart. A comma (,), colon (:), semicolon (;), pipe (|) or newline (LF, CR or on Windows CRLF, which takes 2 bytes) have all been observed in the djungle as legit separators of numerals.

What is a numeral? The concept or idea of the quantity 8 that is IN YOUR HEAD is the number. Any representation of that concept on stone, paper, magnetic tape, or pixels on a screen are just that: REPRESENTATIONS. They are symbols which stand for what you understand in your brain. Those are numerals. Please don't ever confuse numbers with numerals, this distinction is the foundation of mathematics and computer science.

In these cases you want to count an additional character for the separator per numeral. Or maybe per numeral minus one. It depends on if you want to terminate each numeral with a marker or separate the numerals from each other:

Example (three digits and three newlines): 6 bytes

1<LF>
2<LF>
3<LF>

Example (three digits and two commas): 5 bytes

1,2,3

Example (four digits and one comma): 5 bytes

2134,

Example (sign and one digit): 2 bytes

-3

If you store the data in a binary format (not to be confused with the binary number system, which would still be a text format) the occupied memory depends on the integer type (or, better, bit length of the integer).

An octet (0..255) will occupy 1 byte. No separators or leading signs required.

A 16-bit float will occupy 2 bytes. For C and C++ the underlying architecture must be taken into account. A common integer on a 32-bit architecture will take 4 bytes. The very same code, compiled against a 64-bit architecture, will take 8 bytes.

There are exceptions to those flat rules. As an example, Google's protobuf uses a zig-zag VarInt implementation that leverages variable length encoding.

Here is a VarInt implementation in C/C++.


EDIT: added Thomas Weller's suggestion

Beyond the actual file CONTENT you will have to store metadata about the file (for bookkeeping such as the first sector, the filename, access permissions and more). This metadata is not shown for the file occupying space on disk, but actually is there.

If you store each numeral in a separate file such as the numeral 10 in the file result-10, these metadata entries will occupy more space than the numerals themselves.

If you store ten, hundred, thousands or millions/billions of numerals in one file, that overhead becomes increasingly irrelevant.

More about metadata here.


EDIT: to be clearer about file overhead

The overhead is under circumstances relevant, as discussed above.

But it is not a differentiator between textual and binary formats. As doug65536 says, however you store the data, if the filesystem structure is the same, it does not matter.

A file is a file, independently if it contains binary data or ASCII text.

Still, the above reasoning applies independently from the format you choose.

pid
  • 11,472
  • 6
  • 34
  • 63
  • Don't forget a link to https://superuser.com/questions/973213/how-can-a-file-size-be-zero – Thomas Weller Oct 15 '16 at 11:04
  • @ThomasWeller Hahaa! That's a good suggestion! Didn't think about the overhead of maintaining the filesystem itself. Thank you! – pid Oct 15 '16 at 11:23
  • Why is the file system relevant? You store something regardless. – doug65536 Oct 15 '16 at 21:36
  • @doug65536 It is because from the question it is not clear if he uses 1 file for MB of data or many (thousands) of files. It's inefficient and I wouldn't do that, but we don't really know what he does. Furthermore depending on OS large files have overhead scaling differently (on Linux inodes scale linearly with filesize, on NTFS it's not the same, we don't know what he uses). I also say in the answer that if he stores the data *"in one file, that overhead becomes increasingly irrelevant"*. Between text/binary formats size is different, hence the above reasoning. Otherwise -- a file is file. – pid Oct 16 '16 at 08:57
  • @doug65536 I've edit my answer (last part) so that it's clear that I'm NOT saying that there is any difference in respect to the format he chooses. It's a completely independent thing that's not bound to his question, but still relevant knowledge in general. – pid Oct 16 '16 at 09:09
2

The number of digits needed to store a number in a given number base is ceil(log(n)/log(base)).

Storing as decimal would be base 10, storing as hexadecimal text would be base 16. Storing as binary would be base 2.

You would usually need to round up to a multiple of eight or power of two when storing as binary, but it is possible to store a value with an unusual number of bits in a packed format.

Given your example number (ignoring negative numbers for a moment):

56789 in base 2 needs 15.793323887 bits (16)
56789 in base 10 needs 4.754264221 decimal digits (5)
56789 in base 16 needs 3.948330972 hex digits (4)
56789 in base 64 needs 2.632220648 characters (3)

Representing sign needs an additional character or bit.

To look at how binary compares to text, assume a byte is 8 bits, each ASCII character would be a byte in text encoding (8 bits). A byte has a range of 0 to 255, a decimal digit has a range from 0 to 9. Each character (8 bits) can encode about 3.32 bits of a number per byte (log(10)/log(2)). A binary encoding can store 8 bits of a number per byte. Encoding numbers as text takes about 2.4x more space. If you pad out your numbers so they line up in fields, then numbers are very poor storage encoding, with a typical width being 10 digits you'll be storing 80 bits, which would be only 33 bits of binary encoded data.

doug65536
  • 6,562
  • 3
  • 43
  • 53
0

I am not too developed in this subject; however, I believe it would not just be a case of the content, but also the META-DATA attached. But if you were just talking about the number, you could store it in ASCII or in a binary form.

In binary, 56789 could be converted to 1101110111010101; there is a 'simple' way to work this out on paper. But, http://www.binaryhexconverter.com/decimal-to-binary-converter is a website you can use to convert it.

1101110111010101 has 16 characters, therefore 16 bits which is two bytes.

Toby Speight
  • 27,591
  • 48
  • 66
  • 103
JMcL
  • 9
  • 1
  • I'm pretty sure he does not want to save it as binary digits in text format. He could store it as text (`string text = "56789";`, taking 5 bytes at least) or as a number (`int number=56789;`, taking 4 bytes, assuming Int32). – Thomas Weller Oct 15 '16 at 10:57
  • 1
    @ThomasWeller Wait though, he said 16 bits is two bytes. – doug65536 Oct 15 '16 at 10:58
  • @doug65536: that's by accident, taking OPs number (56789) and assuming `int` as `Int16`, which is not the case for most programming languages. The answer is not accurate, because it does not tell you that on a 9, 10,11,12,13,14 or 15 digit binary number you still need 2 bytes. – Thomas Weller Oct 15 '16 at 11:02
  • @ThomasWeller Yeah, I think the real answer involves `ceil(log(n)/log(base))` where base is 10 or 2, with characters scaled up to 8 bits each, and bits rounded up to the next multiple of 8. Plus one for sign, etc. – doug65536 Oct 15 '16 at 11:04
0

Each integer is usually around 4 bytes of storage. So if you are storing the number in binary in the text file, and the binary equivalent is 1101110111010101, there are 16 integers in that binary number. 16 * 4 = 64. So your number will take up about 64 bytes of storage. If your integers were stored in 64bit rather than 32bit, each integer would instead take up 8 bytes of storage, so your total would equal 128 bytes.

Josh Simani
  • 107
  • 10
-1

Before you post any question, you should do your research.

Size of the file depends on many factors but for the sake of simplicity, in text format numbers will occupy 1 byte for each character if you are using UTF-8 encoding. On the other hand a binary value for long data type will take 4 bytes.