I'm going to assume that by half a million words, you are meaning your file is about 5 GB.
If this is the case, you really don't want to read the whole thing into memory. I mean, sure, the whole thing will technically fit into the RAM many computers have (although certainly not all), but it'll also take a while to do it. With a SSD this will take about 10 seconds, which is okay I guess, depending on your application it might be 100% fine, but it certainly isn't speedy for standard desktop app. However, if you're reading it from a HDD, it'll take a good 60 seconds. And that is presuming your hard drive hasn't fragmented the file, if so, it'll be even slower.
Both of the situations are the ideal minimum, and in practice loading a 5 GB file entirely into RAM is going to be slow at best. (Although in some very rare circumstances it is what you want, generally when you are doing high performance computing stuff.)
A better idea, as @Carcigenicate suggested, is to instead stream the file into your program lazily, so that you don't need to have the long pause. To do this, I recommend either in-input-port-bytes
or in-bytes-lines
. These both produce streams that you can then use to process your data, where the first one gives you one byte at a time, and the other gives you one line of bytes at a time. Both until you reach EOF. You can do this in a for
(call-with-input-file "file.txt"
(lambda (f)
(for/fold ([counter 0])
([i (in-input-port-bytes f)])
(+ counter 1))
The above example is a slow way to calculate the number of bytes in a file. But it shows how you can use in-input-port-bytes
.
There are other functions to create a stream of characters rather than bytes from a file: in-lines
, read-port
, etc.