How I would do it without reading the data all in memory and without too much complexity:
First compute the maximum line length in your file (using binary mode)
with open(inputfile,'rb') as source:
max_line_len = max(len(line) for line in source)
Then write another file to disk with the correct padding, so each line has exactly the same size (you need more than twice the size, but since you don't have the memory...). Count the lines at the same time.
with open(inputfile,'rb') as source, open(outputfile,'wb') as dest:
for count,line in enumerate(source):
dest.write(line + b"*"*(max_line_len-len(line))) # write padded
You just created a bigger file, but now the lines have exactly the same length. Well, we padded after the linefeed, which will be useful later. A sample output would be (if max len = 20 for instance):
the first line
****the second line
***another line
******
(not sure of the exact number of stars added but you get the idea, note that the padding char doesn't matter as long as it isn't \n
)
it means that you can seek at the start of any line by a simple multiplication by max_line_len
(like a records file, or a database)
now you can generate the list of line indexes:
indexes = list(range(count+1))
random.shuffle(indexes)
now iterate on this list of indexes, seek to the proper location, read one chunk and split using the fact that we padded after the linefeed, so now we can split it to discard the padding contents.
with open(outputfile,'rb') as source:
for idx in indexes:
source.seek(idx * max_line_len)
random_line = source.read(max_line_len).decode().split("\n")[0]
print(random_line) # or store to another file
I haven't tested this but it should work if you have enough disk. Of course, this is very wasteful if you have one very long line, and the rest are short.