2

Perhaps this is a philosophical question rather than a programming question, but here goes...

In R, is there some package or method that will let you deal with "less than"s as a concept?

Backstory: I have some data which, for privacy reasons, is given as <5 for small numbers (representing integers 1, 2, 3 or 4, in fact). I'd like to do some simple arithmetic on this data (adding, subtracting, averaging, etc.) but obviously I need to find some way to deal with these <5s conceptually. I could replace them all with NAs, sure, but of course that's throwing away potentially useful information, and I would like to avoid that if possible.

Some examples of what I mean:

a <- c(2,3,8)
b <- c(<5,<5,8)
mean(a)
> 4.3333
mean(b)
> 3.3333 -> 5.3333
Unstack
  • 551
  • 3
  • 7
  • 13
  • 1
    Would an implementation of [interval arithmetic](https://en.wikipedia.org/wiki/Interval_arithmetic) help you? `library("sos"); findFn("{interval arithmetic}")` (probably not, now that I think about it) – Ben Bolker Feb 25 '16 at 22:17

5 Answers5

2

If you are interested in the values at the bounds, I would take each dataset and split it into two datasets; one with all <5s set to 1 and one with all <5s set to 4.

a <- c(2,3,8)
b1 <- c(1,1,8)
b2 <- c(4,4,8)

mean(a)
# 4.333333
mean(b1)
# 3.3333
mean(b2)
# 5.3333
C_Z_
  • 7,427
  • 5
  • 44
  • 81
2

Following @hedgedandlevered proposal, but he's wrong wrt normal and/or uniform. You ask for integer numbers, so you have to use discrete distributions, like Poisson, binomial (including negative one), geometric etc

Severin Pappadeux
  • 18,636
  • 3
  • 38
  • 64
2

In statistics "less than" data is known as "left censored" https://en.wikipedia.org/wiki/Censoring_(statistics), searching on "censored data" might help.

My favoured approach to analysing such data is maximum likelihood https://en.wikipedia.org/wiki/Maximum_likelihood. There are a number of R packages for maximum likelihood estimation, I like the survival package https://cran.r-project.org/web/packages/survival/index.html but there are others, e.g. fitdistrplus https://cran.r-project.org/web/packages/fitdistrplus/index.html which "provides functions for fitting univariate distributions to different types of data (continuous censored or non-censored data and discrete data) and allowing different estimation methods (maximum likelihood, moment matching, quantile matching and maximum goodness-of-t estimation)".

You will have to specify (assume?) the form of the distribution of the data; you say it is integer so maybe a Poisson [related] distribution may be appropriate.

user20637
  • 664
  • 3
  • 11
1

Treat them as a certain probability distribution of your choosing, and replace them with actual randomly generated numbers. All equal to 2.5, normal-like distribution capped at 0 and 5, uniform on [0,5] are all options

hedgedandlevered
  • 2,314
  • 2
  • 25
  • 54
  • 1
    You don't sample integers with continuous distributions like normal and uniform – Severin Pappadeux Feb 25 '16 at 18:58
  • it should go without saying that you would use `round()`. And I said normal-like, because a true normal distribution would be boundless. – hedgedandlevered Feb 25 '16 at 19:20
  • 1
    `round()` won't produce integers no matter how you try. `q <- 2.2; t <- round(q, 0); print(class(t))` will print `numeric` – Severin Pappadeux Feb 25 '16 at 19:24
  • ok, then `as.integer()` it. And all of that is if you actually need the representation of integers between 0 and 5 to be integers. If you're doing averaging (or addition without a need for an integer answer), that isn't the case. Note that this is a conceptual question, and without that specification, I have no reason to consider that to be a requirement. – hedgedandlevered Feb 25 '16 at 19:27
  • well, I proposed to use some discrete distribution (perhaps, with rejection). That will solve all problems at once – Severin Pappadeux Feb 25 '16 at 19:54
  • yes. If rounding or integers are problems. in which case `runif() %>% round() %>% as.integer()` accomplishes the same. either way. – hedgedandlevered Feb 25 '16 at 19:59
0

I deal with similar data regularly. I strongly dislike any of the suggestions of replacing the <5 values with a particular number. Consider the following two cases:

  • c(<5,<5,<5,<5,<5,<5,<5,<5,6,12,18)
  • c(<5,6,12,18)

The problem comes when you try to do arithmetic with these.

I think a solution to your issue is to think of the values as factors (in the R sense. You can bin the values above 5 too if that helps, for example

  • c(<5,<5,<5,<5,<5,<5,<5,<5,5-9,10-14,15-19)
  • c(<5,5-9,10-14,15-19)

Now, you still wouldn't do arithmetic on these, but your summary statistics (histograms/proportion tables/etc...) would make more sense.

Jonathan Carroll
  • 3,897
  • 14
  • 34