4

I have a text file which is 10k lines long and I need to build a function to extract 10 random lines each time from this file. I already found how to generate random numbers in Python with numpy and also how to open a file but I don't know how to mix it all together. Please help.

user2719565
  • 45
  • 1
  • 5

6 Answers6

7

If you know that your file is exactly 10k lines long then you can use linecache:

import random
import linecache

def random_lines(filename)
    idxs = random.sample(range(10000), 10)
    return [linecache.getline(filename, i) for i in idxs]

This returns a list with 10 random lines which you can print with:

for line in random_lines('file.txt'):
    print(line)
elyase
  • 39,479
  • 12
  • 112
  • 119
  • A 10K file is unlikely to be 10000 lines long; that would mean an average line length of 0.024 characters… – abarnert Aug 27 '13 at 02:00
  • @abarnert, I understood 10K=10000 lines long, but it could also be 10K=10Kb or 10KB as you mention. – elyase Aug 27 '13 at 02:02
  • This works, however unless you want newline chars appened to every entry in the list you need to strip them out. This post covers that https://stackoverflow.com/questions/15823166/linecache-adding-an-extra-line-to-the-line-that-i-get `return [linecache.getline(filename, i).rstrip('\n') for i in idxs]` – Dave Jul 21 '21 at 14:23
7

If you know how many lines long the file is, you can use linecache, as the other answers suggest. But just knowing that it's 10K in size tells you nothing about how many lines long it is.

If you know the file is small enough to fit into memory—and a 10K file is easily small enough—just read it into memory:

import random

with open('file.txt') as f:
    lines = list(f)
for i in range(10):
    print(random.choice(lines))

But what if you don't know how long the file is, and can't afford to read it all into memory? Then you'll need to do two passes:

import linecache
import random

with open('file.txt') as f:
    linecount = sum(1 for line in f)
for i in range(10):
    print(linecache.getline('file.txt', random.range(linecount)))

Note that both are going to leave newlines at the end of each line. If you want to get rid of these, you can change the first example from list(f) to [line.rstrip() for line in f], or just call rstrip() in the print, or use end='' (Python 3.x) or a trailing comma (Python 2.x) in the print. For the linecache example, the first obviously doesn't work, but you can still do either of the others.


Also note that I used the stdlib random library instead of using numpy here. If you're just generating 10 random numbers to be used in normal Python code, there's no reason to use numpy. (On the other hand, if you do have a good reason to use numpy here, you may want to read the lines into a pandas table and apply the random indices to that.)

abarnert
  • 354,177
  • 51
  • 601
  • 671
2

You may use this code which doesn't care about the file length, however in rare occasions you may get duplicates :

from random import choice
lines = [a.strip() for a in open("yourfile").readlines()]
result = [choice(lines) for a in range(10)]

result is a list containing 10 lines chosen randomly from the file named yourfile.

  • For ranges as small as 10, `xrange(10)` will probably be _slower_, not faster… and it's hard to believe the difference will matter either way. – abarnert Aug 27 '13 at 02:06
  • From a quick test, in Apple 64-bit CPython 2.7.2, `%timeit 'for i in range(10): pass'` takes 17.1ns, `%timeit 'for i in xrange(10): pass'` takes 18.1ns. Of course you are wasting a whole 92 bytes by creating the list, but it's hard to imagine where that would matter… – abarnert Aug 27 '13 at 02:08
  • @abarnert thanks for letting me know, I actually didn't know that xrange() may be slower in some cases; I updated my answer. –  Aug 27 '13 at 02:10
  • If you think about how it works, `xrange` needs a function call into an iterator for each value, while iterating over a list just needs to do a quick C deref-and-increment-pointer for each value. So, until the time to build the list is large enough to overshadow the speedup, `range` will be faster. (Of course that interpretation is only true for CPython, and only on platforms that can deref-and-inc quickly… but that's what most people are dealing with most of the time.) Anyway, it's a premature optimization; if it's clearly not a bottleneck, write whatever's more readable/obvious. – abarnert Aug 27 '13 at 02:14
  • thanks for your answer, unfortunately I can't upvote yet(need 15+ reputation) – user2719565 Aug 27 '13 at 02:44
1

It is possible to do the job with one pass and without loading the entire file into memory as well. Though the code itself is going to be much more complicated and mostly unneeded unless the file is HUGE.

The trick is the following:

Suppose we only need one random line, then first save first line into a variable, then for ith line, replace the currently with probability 1/i. Return the saved line when reaching end of file. For 10 random lines, then have an list of 10 element and do the process 10 times for each line in the file.

qihqi
  • 96
  • 2
0

try linecache:

import linecache
#put your 3 randoms into an array in whichever way you are doing it
lines = [3,45,678]  #use your existing code here
for i in lines:
    linecache.getline('/etc/file', i)
domoarigato
  • 2,802
  • 4
  • 24
  • 41
  • Without knowing how long the file is, how can he know the range to generate the random numbers from? – abarnert Aug 27 '13 at 02:00
0

If you don't know the number of lines of your file, you can count them with, for example, this code:

line_count = 0
with open(filename) as file:
    for line in file:
        line_count += 1

Then, you will be able to generate random numbers within the range [0, line_count) :

import random
lines_to_read = []
for i in range(10):
    line = random.randint(0, line_count - 1)
    lines_to_read.append(line)

And finally, read the file again, select those lines randomly chosen, and do whatever you want with them, for example, print them:

with open(filename) as file:
    for index, line in enumerate(file):
        if index in lines_to_read:
            print line

I hope it helps you! Cheers!

marcelrf
  • 54
  • 1
  • This forces them to print out in sorted order (e.g., if you randomly select `[44, 66, 41, 85, 5, 94, 95, 90, 67, 58]`, you'll actually print lines 5, 41, 44, 58, 66, 67, 85, 90, 94, 95. So it drastically reduces the range of possible outputs. – abarnert Aug 27 '13 at 02:10
  • Also, why use `randint(0, b-1)` instead of just `randrange(0, b)` or `randrange(b)`? You're just asking for off-by-one errors your way. – abarnert Aug 27 '13 at 02:12
  • Well, you may be right. But the question did not require a special order for the output lines. Cheers! – marcelrf Aug 27 '13 at 18:00