Sampling Data into two Groups

Question

I am seeking help to make the code below efficient. I not satisfied though it works. There is bug to be fixed (currently irrelevant). I am using < random> header for the first time and stable_partition for first time.

The Problem definition/specification:
I have a population (vector) of numerical data (float values). I want to create two RANDOM samples (2 vectors) based on a user specified percentage. i.e. popu_data = 30%Sample1 + 70%Sample2 - here 30% will be given by the user. I didnt implement as % yet but its trivial.

The Problem in Programming: I am able to create the 30% Sample from the population. The 2nd part of creating another vector (sample2 - 70%) is my problem. The reason being while selecting the 30% data, I have to select the values randomly. I have to keep track of the indexes to remove them. But some how I am not getting an efficient logic than the one I implemented.

My Logic is (NOT happy): In the population data, the values at random indexes are replaced with a unique value (here it is 0.5555). Later I learnt about stable_partition function where individual values of the Population are compared with 0.5555. On false, that data is created as a new Sample2 which complements sample1.

Further to this: How can I make this Generic i.e. a population into N sub-samples of user defined % of population.

Thank you for any help. I tried vector erase, remove, copy etc but it didn't materialize as the current code. I am looking for a better and more efficient logic and stl usage.

#include <random>
#include <iostream>
#include <vector>
#include <algorithm>

using namespace std;

bool Is05555 (float i){
    if ( i > 0.5560 ) return true;
    return false;
}

int main()
{
    random_device rd;
    mt19937 gen(rd());
    uniform_real_distribution<> dis(1, 2);
    vector<float>randVals;

    cout<<"All the Random Values between 1 and 2"<<endl;
    for (int n = 0; n < 20; ++n) {
        float rnv = dis(gen);
        cout<<rnv<<endl;
        randVals.push_back(rnv);
    }
    cout << '\n';

    random_device rd2;
    mt19937 gen2(rd2());
    uniform_int_distribution<int> dist(0,19);

    vector<float>sample;
    vector<float>sample2;
    for (int n = 0; n < 6; ++n) {
        float rnv = dist(gen2);
        sample.push_back(randVals.at(rnv));
        randVals.at(rnv) = 0.5555;
    }

    cout<<"Random Values between 1 and 2 with 0.5555 a Unique VAlue"<<endl;
    for (int n = 0; n < 20; ++n) {
        cout<<randVals.at(n)<<" ";
    }
    cout << '\n';

    std::vector<float>::iterator bound;
    bound = std::stable_partition (randVals.begin(), randVals.end(), Is05555);

    for (std::vector<float>::iterator it=randVals.begin(); it!=bound; ++it)
        sample2.push_back(*it);

    cout<<sample.size()<<","<<sample2.size()<<endl;

    cout<<"Random Values between 1 and 2 Subset of 6 only: "<<endl;

    for (int n = 0; n < sample.size(); ++n) {
        cout<<sample.at(n)<<" ";
    }
    cout << '\n';

    cout<<"Random Values between 1 and 2 - Remaining: "<<endl;
    for (int n = 0; n < sample2.size(); ++n) {
        cout<<sample2.at(n)<<" ";
    }
    cout << '\n';

    return 0;
}

algorithm function set_difference probably will rescue me - just saw that function popping up on right side column. However, it seems I have to sort before using which is not convincing. — Prasad, Jul 20 '13 at 20:46
For your 30% sample, do you need each sample chosen with 30% probability (could result in sample size *slightly* different from 30%) or exactly 30% of the items chosen? Do you need your results in the original order, or is order of the sample irrelevant? — Jerry Coffin, Jul 20 '13 at 21:43
`vectorsample; for (int n = 0; n < 6; ++n) { float rnv = dist(gen2); sample.push_back(randVals.at(rnv)); } sort(randVals.begin(), randVals.end()); sort(sample.begin(), sample.end()); vector sample2; set_difference(randVals.begin(), randVals.end(), sample.begin(), sample.end(),inserter(sample2,sample2.end()));` **the code using set_difference - it works** — Prasad, Jul 20 '13 at 22:25
@JerryCoffin - for my current needs, probability is irrelevant. that complexity is avoided for now. 30% just means 30% of the data/values from the population. I thought about the order - order of sample is irrelevant. I am facing another problem: In my code, the line (n<6), the random indexes selected are repeating, how do I control repetition of indexes i.e have unique indexes only (the bug I was mentioning above)???? — Prasad, Jul 20 '13 at 22:34

score 1 · Accepted Answer · answered Jul 20 '13 at 22:47

1

Given a requirement for an N% sample, with order irrelevant, it's probably easiest to just do something like:

std::random_shuffle(randVals.begin(), randVals.end());
int num = randVals.size() * percent / 100.0;

auto pos = randVals.begin() + randVals.size() - num;

// get our sample
auto sample1{pos, randVals.end()};

// remove sample from original collection
randVals.erase(pos, randVals.end());

For some types of items in the array, you could improve this by moving items from the original array to the sample array, but for simple types like float or double, that won't accomplish anything.

answered Jul 20 '13 at 22:47

Jerry Coffin

476,176
80
629
1,111

Thank you. Looking for crisp and efficient code. Either we pick randomly (the one I implemented - long & dirty) or random shuffle and get a contiguous piece from the pie (crisp and efficient) - thank you. – Prasad Jul 24 '13 at 10:18
As per this posting : http://stackoverflow.com/questions/13459953/random-shuffle-not-really-random?rq=1, I think I should use srand function above the 5 lines you mentioned ???? Thank you. !! std::srand(std::time(0)); – Prasad Aug 01 '13 at 07:52

Sampling Data into two Groups

1 Answers1