9

Starting with two lists such as:

lstOne = [ '1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
lstTwo = [ '1', '2', '3', '4', '5', '6', '7', '8', '9', '10']

I want to have the user input how many items they want to extract, as a percentage of the overall list length, and the same indices from each list to be randomly extracted. For example say I wanted 50% the output would be

newLstOne = ['8', '1', '3', '7', '5']
newLstTwo = ['8', '1', '3', '7', '5']

I have achieved this using the following code:

from random import randrange

lstOne = [ '1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
lstTwo = [ '1', '2', '3', '4', '5', '6', '7', '8', '9', '10']

LengthOfList = len(lstOne)
print LengthOfList

PercentageToUse = input("What Percentage Of Reads Do you want to extract? ")
RangeOfListIndices = []

HowManyIndicesToMake = (float(PercentageToUse)/100)*float(LengthOfList)
print HowManyIndicesToMake

for x in lstOne:
    if len(RangeOfListIndices)==int(HowManyIndicesToMake):
        break
    else:
        random_index = randrange(0,LengthOfList)
        RangeOfListIndices.append(random_index)

print RangeOfListIndices


newlstOne = []
newlstTwo = []

for x in RangeOfListIndices:
    newlstOne.append(lstOne[int(x)])
for x in RangeOfListIndices:
    newlstTwo.append(lstTwo[int(x)])

print newlstOne
print newlstTwo

But I was wondering if there was a more efficient way of doing this, in my actual use case this is subsampling from 145,000 items. Furthermore, is randrange sufficiently free of bias at this scale?

Thank you

Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
PaulBarr
  • 919
  • 6
  • 19
  • 33
  • 2
    @devnull You are far too aggressive about marking questions as possible duplicates. The other question asks "how do I make a random sample". This question asks two far more interesting questions, "how do I make the same sample from multiple lists" and "are the built-in randomization functions biased". – Raymond Hettinger May 04 '14 at 18:09
  • 1
    @RaymondHettinger How could I argue having watched one of your Python videos earlier during the day? (Close vote retracted.) – devnull May 05 '14 at 02:05

3 Answers3

14

Q. I want to have the user input how many items they want to extract, as a percentage of the overall list length, and the same indices from each list to be randomly extracted.

A. The most straight-forward approach directly matches your specification:

 percentage = float(raw_input('What percentage? '))
 k = len(data) * percentage // 100
 indicies = random.sample(xrange(len(data)), k)
 new_list1 = [list1[i] for i in indicies]
 new_list2 = [list2[i] for i in indicies]

Q. in my actual use case this is subsampling from 145,000 items. Furthermore, is randrange sufficiently free of bias at this scale?

A. In Python 2 and Python 3, the random.randrange() function completely eliminates bias (it uses the internal _randbelow() method that makes multiple random choices until a bias-free result is found).

In Python 2, the random.sample() function is slightly biased but only in the round-off in the last of 53 bits. In Python 3, the random.sample() function uses the internal _randbelow() method and is bias-free.

Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
  • Thanks for your thorough answer. One problem I have in this code is that you can't input values such as 12.5 percent and get the code to round to the nearest value. How would you implement this in your example? – PaulBarr May 04 '14 at 19:15
  • Just for clarification I dont mean rounding the percentage value: I mean if you had 1300 items and you wanted 12.5% of these the code would return 163 items (12.5% is 162.5 items) not 169 items (if it rounds the percentage up to 13%) – PaulBarr May 04 '14 at 19:24
  • @PaulBarr No worries. I just changed the *int* conversion to a *float* conversion. – Raymond Hettinger May 04 '14 at 23:48
  • There was still a problem as the index was a float not an integer so I just added in k = round(k) and k = int(k) to round it up. Thanks for the help! – PaulBarr May 05 '14 at 07:43
1

Just zip your two lists together, use random.sample to do your sampling, then zip again to transpose back into two lists.

import random

_zips = random.sample(zip(lstOne,lstTwo), 5)

new_list_1, new_list_2 = zip(*_zips)

demo:

list_1 = range(1,11)
list_2 = list('abcdefghij')

_zips = random.sample(zip(list_1, list_2), 5)

new_list_1, new_list_2 = zip(*_zips)

new_list_1
Out[33]: (3, 1, 9, 8, 10)

new_list_2
Out[34]: ('c', 'a', 'i', 'h', 'j')
roippi
  • 25,533
  • 4
  • 48
  • 73
  • This is a pretty way to do it, but I can't upvote it because it does too much work (looping over the entire population and saving a tuple for each pair). It is better to build a small list of unique indicies and extracting the desired selections. – Raymond Hettinger May 04 '14 at 18:40
  • No disagreements here :-) – roippi May 04 '14 at 18:49
1

The way you are doing it looks mostly okay to me.

If you want to avoid sampling the same object several times, you could proceed as follows:

a = len(lstOne)
choose_from = range(a)          #<--- creates a list of ints of size len(lstOne)
random.shuffle(choose_from)
for i in choose_from[:a]:       # selects the desired number of items from both original list
    newlstOne.append(lstOne[i]) # at the same random locations & appends to two newlists in
    newlstTwo.append(lstTwo[i]) # sequence
Reblochon Masque
  • 35,405
  • 10
  • 55
  • 80
  • 1
    This does way too much work for large population sizes. The *random.sample()* function uses much less memory and makes fewer calls to the random number generator. – Raymond Hettinger May 04 '14 at 17:57
  • 2
    Thank you kind Sir, you are of course correct. I did not know about random.sample; I learn something every time you post. – Reblochon Masque May 05 '14 at 00:04