1

I want to split a very large dataset that I have (over one million observations) into a test and train set. As, you can see I have already managed to perform something similar in the code bellow with the use of dividerand.

What the code does is we have a very large set X, on every iteration we select N=1700 variables and then I split them in a ratio 7/3 - train/test. But, what I would further like to do though is instead of using %'s with the dividerand to use specific values. For instance, split the data into mini-batches with size 2000, and then use 500 for test and 1500 for training. Again, in the next loop we will select the data (2001:4000) and split them in 500 test and 1500 train etc.

Again, dividerand allows to do that with ratios but I would like to use actual values.

X = randn(10000,9);
mu_6 = zeros(510,613); % 390/802 - 450/695 - 510/613 - Test/Iterations
s2_6 = zeros(510,613);
nl6 = zeros(613,1);
RSME6 = zeros(613,1);
prev_batch = 0;

inf = @infGaussLik;
meanfunc = [];                    % empty: don't use a mean function
covfunc = @covSEiso;              % Squared Exponential covariance
likfunc = @likGauss;              % Gaussian likelihood


for k=1:613
    new_batch = k*1700;
    X_batch = X(1+prev_batch:new_batch,:);
    [train,~,test] = dividerand(transpose(X_batch),0.7,0,0.3);
    train = transpose(train);
    test = transpose(test);
    x_t = train(:,1:8); % Train batch we get 910 values
    y_t = train(:,9);
    x_z = test(:,1:8); % Test batch we get 390 values
    y_z = test(:,9);

    % Calculations for Gaussian process regression
    if k==1
        hyp = struct('mean', [], 'cov', [0 0], 'lik', -1); 
    else
        hyp = hyp2; 
    end
    hyp2 = minimize(hyp, @gp, -100, inf, meanfunc, covfunc, likfunc, x_t, y_t);
    [m4 s4] = gp(hyp2, inf, meanfunc, covfunc, likfunc, x_t, y_t, x_z);
    [nlZ4,dnlZ4] = gp(hyp2, inf, meanfunc, covfunc, likfunc, x_t, y_t);
    RSME6(k,1) = sqrt(sum(((m4-y_z).^2))/450);
    nl6(k,1) = nlZ4;
    mu_6(:,k) = m4;
    s2_6(:,k) = s4;
    % End of calculations

    prev_batch = new_batch;
    disp(k);
end
Jespar
  • 1,017
  • 5
  • 16
  • 29

1 Answers1

0

How about:

[~, idx] = sort([randn(2000,1)]);
group1_idx = idx(1:1500);
group2_idx = idx(1501:end);
Zep
  • 1,541
  • 13
  • 21