0

I have been looking for a way to calculate the minimum number of samples required Ne(min) to train a classification model when the dataset is not normally distributed. A research paper suggests the following :

if the data are not normally distributed, an exponential relationship between d and N will be assumed and the number of samples that are required may be as plentiful as:
Ne(min) = Dsteps^d
where Dsteps is the discrete number of steps per feature.
d: dimension of the dataset.
....
It is useful to think of a histogram approach to understand this relationship. If we want to construct a histogram from data with at least one sample in each bin and with Dsteps discrete steps per feature, we will require at least Dsteps^d samples.
The number of samples required to model the data accurately is in this case an exponential function of d.

I will be very grateful if someone can help me to get/calculate this measure: the discrete number of steps per feature.
An explanation with R or Matlab code would be very helpful. Thank you :D

Edit:
Paper reference: Christiaan Maarten Van Der Walt: Data Measure that Characterises Classification Problems, 2008.

Taha Kamil
  • 25
  • 4
  • @IceCreamToucan the features are continuous variables, and apparently it must be one measure for the whole dataset, not for each feature. – Taha Kamil Dec 30 '19 at 16:58
  • There is information missing. `d` is the number of columns in your feature matrix and you want to determine `Ne(min)` but there is no information given about `Dsteps`. You may want to have a closer look in the research paper (and also provide the reference here...) – max Dec 31 '19 at 09:34
  • @max, yes d is the number of features; and here is the link of the paper: [https://repository.up.ac.za/bitstream/handle/2263/27624/dissertation.pdf] (page 60). Thank you for the help :) – Taha Kamil Dec 31 '19 at 13:37

0 Answers0