0

I have a large dataset and I am trying to estimate function f(x) for all instances in that dataset. Which of the following approach is better?

Approach 1: Sampling N instances from the dataset and use bootstrapping for these N instances to estimate f(x).

Approach 2: M times sample N instances for the large dataset. Then calculate f(x) for each of these M sample cases, then aggregate (for example: average) the result.

Soroosh
  • 477
  • 2
  • 7
  • 18

1 Answers1

1

There is no one definite answer, however usually approaches which simply use more information about the dataset are better (less prone to overfitting). So if your decision is "should I use just N samples but M times internally, or M*N different samples" the answer would be "in absence of problem-specific knowledge - to the second one".

lejlot
  • 64,777
  • 8
  • 131
  • 164
  • So should I divide my dataset to M distinct sets and then take N sample from each or I should take M*N sample from all data? – Soroosh Jul 20 '15 at 20:36
  • There is no one definite answer, as the next question would be "how big should M be?". In general - it is a continuos problem of bias-variance. Lets assume you can get K points. Then putting M=1 (one big chunk of data) leads to high variance. On the other hand puttin K=M (great amount of small chunks) leads to high bias. Everything in between will try to balance variance and bias-the exact solution depends on the particular problem and model used. You will have to fit this "M" to the problem, unfortunately. I would start with some small value of M, lets say 2 or 5, and proceed from this point. – lejlot Jul 20 '15 at 21:27