0

Say I have a 90 megabyte file. It's not encrypted, but it is binary.

I want to store this file into a table as an array of byte values so I can process the file byte by byte.

I can spare up to 2 GB of ram, so something with a thing like jotting down what bytes have been processed, which bytes have yet to be processed, and the processed bytes, would all be good. I don't exactly care about how long it may take to process.

How should I approach this?

RBerteig
  • 41,948
  • 7
  • 88
  • 128
Missingno50
  • 83
  • 10
  • Your post is, despite your efforts, very vague. Is your program an executable file or a lua script? What do mean by 'so it becomes the file's binary?' Also: 'I would also like to put each byte into it's own array'? What do you mean? That makes no sense. – pschulz Jul 21 '16 at 09:16
  • 1
    Start at http://stackoverflow.com/questions/10386672/reading-whole-files-in-lua. – lhf Jul 21 '16 at 12:33
  • ... Okay... I meant something like... say a program as in a .exe file or a .txt file. What I actually meant was any sort of file. And by in every array I meant while it's loading it would split the program's bytes into an array table where each little spot would be 1 byte, making it easier to process. As for directly above(lhf), uh, thanks, but it's not quiet what I was looking for. But it brought me a little closer. Thanks. – Missingno50 Jul 21 '16 at 14:15
  • I made a stab at reducing the question to what is being asked, striking out the irrelevant stuff. That said, the original question mentioned cryptography. Do be aware of the pitfalls of inventing (or even implementing) your own cryptography. That way lies madness for many. – RBerteig Jul 21 '16 at 21:27
  • That's not... whatever. Also I already figured out how-I just needed a way to split the file up so it could work. (I did mathmatics on 3 bytes using 01001001, 11110000 and 1010101010 as my bytes). As for your question system, it says 2 question is too much DESPITE my last question being from April 6th. A little help in explanation of how 2 questions spaced across months is too much? Anyways, THANK YOU ALL for helping, you've been a great help!(Especially you RBerteig) – Missingno50 Jul 22 '16 at 02:14
  • @Missingno50 ask as many questions as you like, but one question per Question so that Answers match up nicely. We avoid the "thanks" and "please help" sort of boilerplate as implicit with participation in the site. It is possible that there is some metering applied to really new users to encourage you to work at asking good questions and soak up the flavor of the site. Don't give up, SO is worth the effort, and the questions you ask well will help others for years to come. – RBerteig Jul 22 '16 at 21:03
  • No I mean when I try to post another question(It's related to a BSOD CRITICAL_STRUCTURE_CORRUPTION issues) – Missingno50 Jul 23 '16 at 03:10

2 Answers2

1

Note I've expanded and rewritten this answer due to Egor's comment.

You first need the file open in binary mode. The distinction is important on Windows, where the default text mode will change line endings from CR+LF into C newlines. You do this by specifying a mode argument to io.open of "rb".

Although you can read a file one byte at a time, in practice you will want to work through the file in buffers. Those buffers can be fairly large, but unless you know you are handling only small files in a one-off script, you should avoid reading the entire file into a buffer with file:read"*a" since that will cause various problems with very large files.

Once you have a file open in binary mode, you read a chunk of it using buffer = file:read(n), where n is an integer count of bytes in the chunk. Using a moderately sized power of two will likely be the most efficient. The return value will either be nil, or will be a string of up to n bytes. If less than n bytes long, that was the last buffer in the file. (If reading from a socket, pipe, or terminal, however, reads less than n may only indicate that no data has arrived yet, depending on lots of other factors to complex to explain in this sentence.)

The string in buffer can be processed any number of ways. As long as #buffer is not too big, then {buffer:byte(1,-1)} will return an array of integer byte values for each byte in the buffer. Too big partly depends on how your copy of Lua was configured when it was built, and may depend on other factors such as available memory as well. #buffer > 1E6 is certainly too big. In the example that follows, I used buffer:byte(i) to access each byte one at a time. That works for any size of buffer, at least as long as i remains an integer.

Finally, don't forget to close the file.

Here's a complete example, lightly tested. It reads a file a buffer at a time, and accumulates the total size and the sum of all bytes. It then prints the size, sum, and average byte value.

-- sum all bytes in a file
local name = ...
assert(name, "Usage: "..arg[0].." filename")

file = assert(io.open(name, "rb"))
local sum, len = 0,0
repeat
    local buffer = file:read(1024)
    if buffer then
        len = len + #buffer
        for i = 1, #buffer do
            sum = sum + buffer:byte(i)
        end
    end
until not buffer
file:close()
print("length:",len)
print("sum:",sum)
print("mean:", sum / len)

Run with Lua 5.1.4 on my Windows box using the example as its input, it reports:

length: 402
sum:    30374
mean:   75.557213930348
RBerteig
  • 41,948
  • 7
  • 88
  • 128
  • 1
    Number of values returned by a function (such as `buffer:byte(1,-1)`) must be < 10^6, otherwise stack overflow exception will be raised – Egor Skriptunoff Jul 22 '16 at 18:29
  • 1
    Good point. I did say "untested"... ;-) The best answer has to be to work on the file in sensible sized blocks, whether purely in a string or exploded into an array. – RBerteig Jul 22 '16 at 20:57
0

To split the contents of a string s into an array of bytes use {s:byte(1,-1)}.

lhf
  • 70,581
  • 9
  • 108
  • 149