Dealing with "less than"s in R

Question

Perhaps this is a philosophical question rather than a programming question, but here goes...

In R, is there some package or method that will let you deal with "less than"s as a concept?

Backstory: I have some data which, for privacy reasons, is given as <5 for small numbers (representing integers 1, 2, 3 or 4, in fact). I'd like to do some simple arithmetic on this data (adding, subtracting, averaging, etc.) but obviously I need to find some way to deal with these <5s conceptually. I could replace them all with NAs, sure, but of course that's throwing away potentially useful information, and I would like to avoid that if possible.

Some examples of what I mean:

a <- c(2,3,8)
b <- c(<5,<5,8)
mean(a)
> 4.3333
mean(b)
> 3.3333 -> 5.3333

Would an implementation of [interval arithmetic](https://en.wikipedia.org/wiki/Interval_arithmetic) help you? `library("sos"); findFn("{interval arithmetic}")` (probably not, now that I think about it) — Ben Bolker, Feb 25 '16 at 22:17

score 2 · Answer 1 · answered Feb 25 '16 at 18:13

If you are interested in the values at the bounds, I would take each dataset and split it into two datasets; one with all <5s set to 1 and one with all <5s set to 4.

a <- c(2,3,8)
b1 <- c(1,1,8)
b2 <- c(4,4,8)

mean(a)
# 4.333333
mean(b1)
# 3.3333
mean(b2)
# 5.3333

score 2 · Answer 2 · answered Feb 25 '16 at 18:57

2

Following @hedgedandlevered proposal, but he's wrong wrt normal and/or uniform. You ask for integer numbers, so you have to use discrete distributions, like Poisson, binomial (including negative one), geometric etc

answered Feb 25 '16 at 18:57

Severin Pappadeux

18,636
3
38
64

score 2 · Answer 3 · answered Feb 26 '16 at 09:07

In statistics "less than" data is known as "left censored" https://en.wikipedia.org/wiki/Censoring_(statistics), searching on "censored data" might help.

My favoured approach to analysing such data is maximum likelihood https://en.wikipedia.org/wiki/Maximum_likelihood. There are a number of R packages for maximum likelihood estimation, I like the survival package https://cran.r-project.org/web/packages/survival/index.html but there are others, e.g. fitdistrplus https://cran.r-project.org/web/packages/fitdistrplus/index.html which "provides functions for fitting univariate distributions to different types of data (continuous censored or non-censored data and discrete data) and allowing different estimation methods (maximum likelihood, moment matching, quantile matching and maximum goodness-of-t estimation)".

You will have to specify (assume?) the form of the distribution of the data; you say it is integer so maybe a Poisson [related] distribution may be appropriate.

score 1 · Answer 4 · answered Feb 25 '16 at 18:09

1

Treat them as a certain probability distribution of your choosing, and replace them with actual randomly generated numbers. All equal to 2.5, normal-like distribution capped at 0 and 5, uniform on [0,5] are all options

answered Feb 25 '16 at 18:09

hedgedandlevered

2,314
2
25
54

1

You don't sample integers with continuous distributions like normal and uniform – Severin Pappadeux Feb 25 '16 at 18:58
it should go without saying that you would use `round()`. And I said normal-like, because a true normal distribution would be boundless. – hedgedandlevered Feb 25 '16 at 19:20
1

`round()` won't produce integers no matter how you try. `q <- 2.2; t <- round(q, 0); print(class(t))` will print `numeric` – Severin Pappadeux Feb 25 '16 at 19:24
ok, then `as.integer()` it. And all of that is if you actually need the representation of integers between 0 and 5 to be integers. If you're doing averaging (or addition without a need for an integer answer), that isn't the case. Note that this is a conceptual question, and without that specification, I have no reason to consider that to be a requirement. – hedgedandlevered Feb 25 '16 at 19:27
well, I proposed to use some discrete distribution (perhaps, with rejection). That will solve all problems at once – Severin Pappadeux Feb 25 '16 at 19:54
yes. If rounding or integers are problems. in which case `runif() %>% round() %>% as.integer()` accomplishes the same. either way. – hedgedandlevered Feb 25 '16 at 19:59

score 0 · Answer 5 · answered Feb 25 '16 at 21:34

I deal with similar data regularly. I strongly dislike any of the suggestions of replacing the <5 values with a particular number. Consider the following two cases:

c(<5,<5,<5,<5,<5,<5,<5,<5,6,12,18)
c(<5,6,12,18)

The problem comes when you try to do arithmetic with these.

I think a solution to your issue is to think of the values as factors (in the R sense. You can bin the values above 5 too if that helps, for example

c(<5,<5,<5,<5,<5,<5,<5,<5,5-9,10-14,15-19)
c(<5,5-9,10-14,15-19)

Now, you still wouldn't do arithmetic on these, but your summary statistics (histograms/proportion tables/etc...) would make more sense.

Perhaps, but isn't this also throwing away more information? — Unstack, Feb 26 '16 at 13:00

Dealing with "less than"s in R

5 Answers5