4

I have a big array of items and another array of weights of the same size. I would like to sample without replacement from the first array based on the weights from the second array. Is there a way to do this using gonum?

Marco Bonelli
  • 63,369
  • 21
  • 118
  • 128
alpaca
  • 1,211
  • 13
  • 23
  • Can you show an example of your data? What is "big"? Are we talking hundreds, or billions? What are the weights? Are they percentages in `float64`, or integer multipliers? Can the entire set of weights be summed accurately with 64 bits? – JimB Jun 14 '18 at 22:14
  • I have not implemented them yet. To answer your questions: - The data is in the order of ten/hundred thousands. - The weights are some "score" calculated. I can normalize them to sum to 1 and change the type to float64. - Not sure about the last point. If I normalize an array of float64, is this possible? – alpaca Jun 14 '18 at 22:17
  • the two arrays can alternatively be a list of tuples. Whichever easier to deal with is ok. – alpaca Jun 14 '18 at 22:22
  • Ah, I didn't notice the gonum tag, and thought you were asking to implement this. Converting the weights to the format expected by in the stat packages should suffice. – JimB Jun 14 '18 at 23:00

1 Answers1

3

Weighted and its relative method .Take() look exactly like what you want.

From the doc:

func NewWeighted(w []float64, src *rand.Rand) Weighted

NewWeighted returns a Weighted for the weights w. If src is nil, rand.Rand is used as the random source. Note that sampling from weights with a high variance or overall low absolute value sum may result in problems with numerical stability.

func (s Weighted) Take() (idx int, ok bool)

Take returns an index from the Weighted with probability proportional to the weight of the item. The weight of the item is then set to zero. Take returns false if there are no items remaining.

Therefore Take is indeed what you need for sampling without replacement.

You can use NewWeighted to create a Weighted with the given weights, then use Take to extract one index with probability based on the previously set weights, and then select the item at the extracted index from your array of samples.


Working example:

package main

import (
    "fmt"
    "time"

    "golang.org/x/exp/rand"

    "gonum.org/v1/gonum/stat/sampleuv"
)

func main() {
    samples := []string{"hello", "world", "what's", "going", "on?"}
    weights := []float64{1.0, 0.55, 1.23, 1, 0.002}

    w := sampleuv.NewWeighted(
        weights,
        rand.New(rand.NewSource(uint64(time.Now().UnixNano())))
    )

    i, _ := w.Take()

    fmt.Println(samples[i])
}
Marco Bonelli
  • 63,369
  • 21
  • 118
  • 128
  • `NewWeighted` already calls `ReweightAll` to build the internal heap. – JimB Jun 14 '18 at 23:00
  • Just curious: any specific reason to use `golang.org/x/exp/rand` instead of `https://golang.org/pkg/math/rand/`? – kostix Jun 15 '18 at 07:09
  • 1
    @kostix that's a funny question. It's not my choice: developers of Gonum decided to use x/exp/rand instead of math/rand I believe for better performance. Take a look at [this merged pull request](https://github.com/gonum/gonum/pull/301). – Marco Bonelli Jun 15 '18 at 10:20
  • The choice to change random package was made because randomness sources can become important, and having the capacity to move to higherquality sources when they become available is important. There are other important benefits also. See https://github.com/golang/go/issues/21835 – kortschak Jul 04 '18 at 12:22
  • 1
    @MarcoBonelli BTW, `NewWeighted` can take a `nil` `rand.Source` and use the package global rand functions rather than having to pass in a source (this would simplify your example). Also that parameter needs a trailing comma. I had an edit this did both of these things, but it was rejected. – kortschak Jul 05 '18 at 07:52