Choosing a magic byte least likely to appear in real data

Question

I hope this isn't too opinionated for SO; it may not have a good answer.

In a portion of a library I'm writing, I have a byte array that gets populated with values supplied by the user. These values might be of type Float, Double, Int (of different sizes), etc. with binary representations you might expect from C, say. This is all we can say about the values.

I have an opportunity for an optimization: I can initialize my byte array with the byte MAGIC, and then whenever no byte of the user-supplied value is equal to MAGIC I can take a fast path, otherwise I need to take the slow path.

So my question is: what is a principled way to go about choosing my magic byte, such that it will be reasonably likely not to appear in the (variously-encoded and distributed) data I receive?

Part of my question, I suppose, is whether there's something like a Benford's law that can tell me something about the distribution of bytes in many sorts of data.

Look at the variously-encoded and distributed data you have and pick the least frequent byte? If you don't have any data yet, make your magic byte variable and re-write your program to re-choose the magic byte every-so-often based on the least frequent byte in your received data? Also, magic numbers are usually more than 8 bits in length, to increase the probability of uniquity. — bzlm, Nov 21 '14 at 23:14

score 2 · Accepted Answer · answered Nov 21 '14 at 23:16

2

Capture real-world data from a diverse set of inputs that would be used by applications of your library.

Write a quick and dirty program to analyze dataset. It sounds like what you want to know is which bytes are most frequently totally excluded. So the output of the program would say, for each byte value, how many inputs do not contain it.

This is not the same as least frequent byte. In data analysis you need to be careful to mind exactly what you're measuring!

Use the analysis to define your architecture. If no byte never appears, you can abandon the optimization entirely.

answered Nov 21 '14 at 23:16

Potatoswatter

134,909
25
265
421

If the number of occurrences can be 0, then it's certainly the same as the least frequent byte. Don't over-complicate things. :) – bzlm Nov 21 '14 at 23:23
@bzlm And if zero isn't the result of the analysis program, you have to write a whole new program. Is that less complicated? – Potatoswatter Nov 21 '14 at 23:26
"It sounds like what you want to know is which bytes are most frequently totally excluded. So the output of the program would say, for each byte value, how many inputs do not contain it." Correct, and good point! Also I suppose I'm asking too much to be able to choose such a byte without thinking about what the real-world data is likely to look like... – jberryman Nov 21 '14 at 23:30

score 0 · Answer 2 · answered Feb 22 '22 at 10:57

0

I was inclined to use byte 255 but I discovered that is also prevalent in MSWord files. So I use byte 254 now, for EOF code to terminate a file.

answered Feb 22 '22 at 10:57

grtamayo

1

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Feb 23 '22 at 08:17

Choosing a magic byte least likely to appear in real data

2 Answers2