0

I want to determine if a data point belongs to a population. So discriminant analysis will determine how likely it is between multiple (2+) populations. But if someone asks if their sample is part of the population, I would like to test based on the parameters and let them know how likely it is that the sample came from the population. Any ideas?

x <- rnorm(1000,4,2)
y <- rnorm(1000,17,6)
z <- rbeta(1000,9,2)

orig.pop <- data.frame(x,y,z)

new.point <- c(5,18.5,.6)
Schatzi121
  • 73
  • 1
  • 7
  • Well, whether a sample is from a given population can only be judged relative to any other possibility. So either explicitly or implicitly (preferably explicitly), you have to assume some alternative. When the alternative is implicit, this problem falls under the heading of "outlier detection". You might want to consider it some more and then follow up on stats.stackexchange.com, as this is a conceptual question and not about programming. Incidentally my own preference is to make explicit assumptions about alternatives, and then it becomes an exercise in Bayesian inference. – Robert Dodier Nov 04 '22 at 17:01
  • Can't it be: The probability of getting those values, given the mean/dist of the population? Like a simple gaussian curve, where you can have probabilities of getting a specific value, given the distribution. – Schatzi121 Nov 04 '22 at 19:30
  • I like the idea of identifying a multivariate outlier. Which package/function would you use for that? – Schatzi121 Nov 04 '22 at 19:33
  • I found a method that appears useful would be the mahalanobis distance. I think that will do what I need. The code for that is here, to be used in conjunction with the code in my quesiton: mahalanobis(orig.pop, colMeans(orig.pop), cov(orig.pop)) – Schatzi121 Nov 04 '22 at 19:42
  • Mahalanobis distance is probably a reasonable thing to do. Bear in mind, though, that it falls into the category of implicit assumptions about alternative distributions. In particular, if your outlier detection rule is distance greater than threshold, your alternative is implicitly a uniform distribution over the input space. – Robert Dodier Nov 05 '22 at 01:22
  • Okay, any ideas for non-uniform distributions? I can start with the mahalanobis distance, but some of the parameters will definitely have skewed distributions. The problem is that I create prediction equations, but then people use them with very different inputs, well outside the range from where I created it. So I want to run this, and if it isn't part of the fit population, i want to flag it and not let them run the prediction tool. – Schatzi121 Nov 06 '22 at 02:27

0 Answers0