Action selection with softmax?

Question

I know this might be a pretty stupid question to ask, but what the hell..

I at the moment trying to implement soft max action selector, which uses the boltzmann distribution.

What I am bit unsure about, is how how do known if you want to use a specific action? I mean the function provides me with a probability?, but how do I use that to select which action I want to perform?

Are you asking how to go about generating a random action choice, based on the distribution of probabilities assigned to each action, given by the softmax function? — Christopher Oicles, May 23 '16 at 22:36
I am bit unsure how to use this formula.. Do you use the action which has the highest probability or how do go by this? — Vato, May 23 '16 at 22:40
Selecting the action with the highest weight would correspond to a purely "greedy" selection policy -- but for this, you wouldn't need to use softmax activation at all, because the action with the greatest weight before softmax will also have the greatest softmax probability. Softmax maps its inputs to a set of probabilities which sum to 1, and its temperature parameter specifies an interpolation between the purely greedy selection policy and a selection policy where all actions are equally probable. After this, I would expect a random selection, using the probability distribution. — Christopher Oicles, May 23 '16 at 23:01
Ok -- I'll add an answer, because I need a place to put code. — Christopher Oicles, May 23 '16 at 23:04

Christopher Oicles · Accepted Answer · 2016-05-27T00:51:49.470

For some machine learning applications, there is a point where a set of raw outputs (like from a neural network) needs to be mapped to a set of probabilities, normalized to sum to 1.

In reenforcement learning, a set of available actions' weights might need to be mapped to a set of associated probabilities, which will then by used to randomly select the next action taken.

The Softmax function is commonly used to map output weights to a set of corresponding probabilities. A "temperature" parameter allows the selection policy to be tuned, interpolating between pure exploitation (a "greedy" policy, where the highest-weighted action is always chosen) and pure exploration (where each action has an equal probability of being chosen).

This is a simple example of using the Softmax function. Each "action" corresponds to one indexed entry in the vector<double> objects passed around in this code.

#include <iostream>
#include <iomanip>
#include <vector>
#include <random>
#include <cmath>


using std::vector;

// The temperature parameter here might be 1/temperature seen elsewhere.
// Here, lower temperatures move the highest-weighted output
// toward a probability of 1.0.
// And higer temperatures tend to even out all the probabilities,
// toward 1/<entry count>.
// temperature's range is between 0 and +Infinity (excluding these
// two extremes).
vector<double> Softmax(const vector<double>& weights, double temperature) {
    vector<double> probs;
    double sum = 0;
    for(auto weight : weights) {
        double pr = std::exp(weight/temperature);
        sum += pr;
        probs.push_back(pr);
    }
    for(auto& pr : probs) {
        pr /= sum;
    }
    return probs;
}

// Rng class encapsulates random number generation
// of double values uniformly distributed between 0 and 1,
// in case you need to replace std's <random> with something else.
struct Rng {
    std::mt19937 engine;
    std::uniform_real_distribution<double> distribution;
    Rng() : distribution(0,1) {
        std::random_device rd;
        engine.seed(rd());
    }
    double operator ()() {
        return distribution(engine);
    }
};

// Selects one index out of a vector of probabilities, "probs"
// The sum of all elements in "probs" must be 1.
vector<double>::size_type StochasticSelection(const vector<double>& probs) {

    // The unit interval is divided into sub-intervals, one for each
    // entry in "probs".  Each sub-interval's size is proportional
    // to its corresponding probability.

    // You can imagine a roulette wheel divided into differently-sized
    // slots for each entry.  An entry's slot size is proportional to
    // its probability and all the entries' slots combine to fill
    // the entire roulette wheel.

    // The roulette "ball"'s final location on the wheel is determined
    // by generating a (pseudo)random value between 0 and 1.
    // Then a linear search finds the entry whose sub-interval contains
    // this value.  Finally, the selected entry's index is returned.

    static Rng rng;
    const double point = rng();
    double cur_cutoff = 0;

    for(vector<double>::size_type i=0; i<probs.size()-1; ++i) {
        cur_cutoff += probs[i];
        if(point < cur_cutoff) return i;
    }
    return probs.size()-1;
}

void DumpSelections(const vector<double>& probs, int sample_count) {
    for(int i=0; i<sample_count; ++i) {
        auto selection = StochasticSelection(probs);
        std::cout << " " << selection;
    }
    std::cout << '\n';
}

void DumpDist(const vector<double>& probs) {
    auto flags = std::cout.flags();
    std::cout.precision(2);
    for(vector<double>::size_type i=0; i<probs.size(); ++i) {
        if(i) std::cout << "  ";
        std::cout << std::setw(2) << i << ':' << std::setw(8) << probs[i];
    }
    std::cout.flags(flags);
    std::cout << '\n';
}

int main() {
    vector<double> weights = {1.0, 2, 6, -2.5, 0};

    std::cout << "Original weights:\n";
    for(vector<double>::size_type i=0; i<weights.size(); ++i) {
        std::cout << "    " << i << ':' << weights[i];
    }
    std::cout << "\n\nSoftmax mappings for different temperatures:\n";
    auto softmax_thalf  = Softmax(weights, 0.5);
    auto softmax_t1     = Softmax(weights, 1);
    auto softmax_t2     = Softmax(weights, 2);
    auto softmax_t10    = Softmax(weights, 10);

    std::cout << "[Temp 1/2] ";
    DumpDist(softmax_thalf);
    std::cout << "[Temp 1]   ";
    DumpDist(softmax_t1);
    std::cout << "[Temp 2]   ";
    DumpDist(softmax_t2);
    std::cout << "[Temp 10]  ";
    DumpDist(softmax_t10);

    std::cout << "\nSelections from softmax_t1:\n";
    DumpSelections(softmax_t1, 20);
    std::cout << "\nSelections from softmax_t2:\n";
    DumpSelections(softmax_t2, 20);
    std::cout << "\nSelections from softmax_t10:\n";
    DumpSelections(softmax_t10, 20);
}

Here is an example of the output:

Original weights:
    0:1    1:2    2:6    3:-2.5    4:0

Softmax mappings for different temperatures:
[Temp 1/2]  0: 4.5e-05   1: 0.00034   2:       1   3: 4.1e-08   4: 6.1e-06
[Temp 1]    0:  0.0066   1:   0.018   2:    0.97   3:  0.0002   4:  0.0024
[Temp 2]    0:   0.064   1:    0.11   2:    0.78   3:   0.011   4:   0.039
[Temp 10]   0:    0.19   1:    0.21   2:    0.31   3:    0.13   4:    0.17

Selections from softmax_t1:
 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1

Selections from softmax_t2:
 2 2 2 2 2 2 1 2 2 1 2 2 2 1 2 2 2 2 2 1

Selections from softmax_t10:
 0 0 4 1 2 2 2 0 0 1 3 4 2 2 4 3 2 1 0 1

Hey @Christopher , can you please help with implementation in python,would be really helpful. Thanks — chink, Sep 25 '19 at 15:48
how to decide on what temperature to use? Is it an hyper parameter? — chink, Sep 25 '19 at 16:02

Action selection with softmax?

1 Answers1