How to choose a random line from a text file

Question

I am trying to make a lottery program for my school (we have an economic system).

My program generates numbers and saves it off into a text file. When I want to "pull" numbers out of my generator I want it to ensure that there is a winner.

Q: How do I have Python select a random line out of my text file and give my output as that number?

score 17 · Accepted Answer · answered Feb 17 '13 at 18:50

How do I have python select a random line out of my text file and give my output as that number?

Assuming the file is relatively small, the following is perhaps the easiest way to do it:

import random
line = random.choice(open('data.txt').readlines())

score 12 · Answer 2 · edited Sep 08 '13 at 23:29

12

If the file is very large - you could seek to a random location in the file given the file size and then get the next full line:

import os, random 
def get_random_line(file_name):
    total_bytes = os.stat(file_name).st_size 
    random_point = random.randint(0, total_bytes)
    file = open(file_name)
    file.seek(random_point)
    file.readline() # skip this line to clear the partial line
    return file.readline()

edited Sep 08 '13 at 23:29

mlissner

17,359
18
106
169

answered Feb 18 '13 at 22:52

Ali-Akber Saifee

4,406
1
16
18

5

that method would give shorter lines a smaller chance of being chosen, so it's not a good choice if you really want your random generator choose each line with the same probability. – mata Sep 08 '13 at 23:57
1

it would also never return the first line, and won't return a line at all when random_point is in the last line. – Pascal Hofmann Jun 13 '17 at 12:53

score 6 · Answer 3 · answered Feb 23 '16 at 13:41

def random_line():
    line_num = 0
    selected_line = ''
    with open(filename) as f:
        while 1:
            line = f.readline()
            if not line: break
            line_num += 1
            if random.uniform(0, line_num) < 1:
                selected_line = line
    return selected_line.strip()

Although most of the approaches given here would work, but they tend to load the whole file in the memory at once. But not this approach. So even if the files are big, this would work.

The approach is not very intuitive at first glance. The theorem behind this states that when we have seen N lines in there is a probability of exactly 1/N that each of them is selected so far.

From Page no 123 of 'Python Cookbook'

score 3 · Answer 4 · edited Feb 18 '13 at 22:30

3

Off the top of my head:

import random
def pick_winner(self):
    lines = []
    with open("file.txt", "r") as f:
        lines = f.readlines();
    random_line_num = random.randrange(0, len(lines))
    return lines[random_lines_num]

edited Feb 18 '13 at 22:30

Marcello B.

4,177
11
45
65

answered Feb 17 '13 at 18:53

Srdjan Grubor

2,605
15
17

score 3 · Answer 5 · answered Feb 17 '13 at 19:07

With a slight modification to your input file (store the number of items in the first line), you can choose a number uniformly without having to read the entire file into memory first.

import random
def choose_number( frame ):
    with open(fname, "r") as f:
        count = int(f.readline().strip())
        for line in f:
            if not random.randrange(0, count):
                return int(line.strip())
            count-=1

Say you have 100 numbers. The probability of choosing the first number is 1/100. The probability of choosing the second number is (99/100)(1/99) = 1/100. The probability of choosing the third number is (99/100)(98/99)(1/98) = 1/100. I'll skip the formal proof, but the odds of choosing any of the 100 numbers is 1/100.

It's not strictly necessary to store the count in the first line, but it saves you the trouble of having to read the entire file just to count the lines. Either way, you don't need to store the entire file in memory to choose any single line with equal probability.

if you already have the number of lines as first element, then there is no need to call `random.randrange` for each line. just randomly select the line number and move forward to that line. — mata, Sep 09 '13 at 00:02

Fredrik Pihl · Answer 6 · 2013-02-17T20:08:57.520

2

another approach:

import random, fileinput

text = None
for line in fileinput.input('data.txt'):
    if random.randrange(fileinput.lineno()) == 0:
        text = line
print text

Distribution:

$ seq 1 10 > data.txt

# run for 100000 times
$ ./select.py > out.txt

$ wc -l out.txt 
100000 out.txt

$ sort out.txt | uniq -c
  10066 1
  10004 10
  10023 2
   9979 3
   9926 4
   9936 5
   9878 6
  10023 7
  10154 8
  10011 9

I don't see the skewnes but perhaps the dataset is too small...

edited Feb 17 '13 at 20:08

answered Feb 17 '13 at 18:58

Fredrik Pihl

44,604
7
83
130

This skews the choice towards numbers that appear earlier in the file. – chepner Feb 17 '13 at 19:00
It's skewed a little differently than I expected (I didn't look at your code carefully). You are basically selecting a set of numbers from 1 to 10, then outputing the largest one. So although there's a greater chance of 1 being chosen as part of the set (in fact, it will *always* be part of the set, since `randrange(0,1)` will always return 0), it will never be returned unless *no* other number is chosen. Notice your distribution looks like an inverted bell curve, with the extreme numbers chosen signicantly more often then the middle numbers. – chepner Feb 17 '13 at 20:12
@chepner - you're on to something here. Learnt something new today. I think I'l play around with the distribution for a while. Thanx. – Fredrik Pihl Feb 17 '13 at 20:19
1

So, I worked out some probabilities, and the bias towards *selecting* smaller numbers and the bias towards *outputting* larger ones cancel perfectly, so you should get a 1/10 chance of outputting any of the 10 numbers. For example, the odds of *selecting* the number 3 is 1/3, but the odds of *not* selecting a larger number are (3/4)(4/5)...(9/10), and the product of the two works out to exactly 1/10. I think the inverted bell curve I saw is probably due to the small sample size. – chepner Feb 17 '13 at 20:26
1

+1. there is no skewness. The algorithm produces uniform distribution. @chepner: it is a well-known [reservoir sampling with k==1 algorithm](http://en.wikipedia.org/wiki/Reservoir_sampling). [`enumerate()` could be used instead of `fileinput.lineno()`](http://askubuntu.com/a/527778/3712) – jfs Sep 24 '14 at 18:35

Marcello B. · Answer 7 · 2017-11-29T17:22:45.513

I saw a python tutorials and found this snippet:

def randomLine(filename):
#Retrieve a  random line from a file, reading through the file once
        fh = open("KEEP-IMPORANT.txt", "r")
        lineNum = 0
        it = ''

        while 1:
                aLine = fh.readline()
                lineNum = lineNum + 1
                if aLine != "":
                        #
                        # How likely is it that this is the last line of the file ? 
                        if random.uniform(0,lineNum)<1:
                                it = aLine
                else:
                        break
        nmsg=it
        return nmsg
        #this is suposed to be a var pull = randomLine(filename)

How to choose a random line from a text file

7 Answers7

Linked

Related