Median of a frequency distribution

Question

I want to calculate the median of a frequency distribution for a large number of samples. Each of the samples have a number of classes (3 in the example) and their respective frequencies. Each of the classes is associated with a different value

data <- data.frame(sample=c(1,2,3,4,5), 
                   freq_class1=c(1,1,59,10,2), 
                   freq_class2=c(1,0,35,44,22), 
                   freq_class3=c(0,4,1,9,2), 
                   value_class1=c(12,11,14,11,13), 
                   value_class2=c(27,33,34,31,29), 
                   value_class3=c(75,78,88,81,65))

For example the median of sample 1 would be 19.5. I assume that this can be done using quantile() on the frequency distribution of each sample, but all attempts failed.

Do you have any suggestion?

Can you please explain how you calculate the median to be 19.5? Since the values of class 1 have a max of 14, surely the median must be lower than 14. Please explain what your data means. — Andrie, Jan 22 '13 at 18:03
@Andrie his first sample has 1 value of 12 and 1 value of 27 (samples are rows, not columns--it's a strange way to set up the data set...). — Jonathan Christensen, Jan 22 '13 at 18:04
There are 37000 samples, so they are organized in rows to make things easier to grasp :) — user12975, Jan 22 '13 at 18:08
@user12975 Mind if I ask what kind of data this is? I can't help but be curious about data that only takes three values in each sample no matter how large the sample, but the three values are different every time... — Jonathan Christensen, Jan 22 '13 at 18:11
In reality there are 8 classes per sample. Each sample is a "census unit". I know how many properties between given sizes there are in each sample, and what is their average size per sample and class. Two different samples usually have different average sizes. The mess is a consequence of the data provider doing its best to aggregate nominal data to avoid researchers knowing too much about what people own or deforest. — user12975, Jan 22 '13 at 18:18

score 4 · Accepted Answer · answered Jan 22 '13 at 18:03

4

This is probably not the most elegant way, but it works: basically, I'm recreating the full data vector from the information contained in the data.frame, then finding the median of that. Writing a function to do it lets me use apply to quickly do it to each row of the data.frame.

find.median <- function(x) {
  full.x <- rep(x[5:7],times=x[2:4])
  return(median(full.x))
}

> apply(data,1,find.median)
[1] 19.5 78.0 14.0 31.0 29.0

answered Jan 22 '13 at 18:03

Jonathan Christensen

3,756
1
19
16

Thanks a lot! I am still trying to understand how it goes though, it has nothing to do with what I was trying... – user12975 Jan 22 '13 at 18:22

Median of a frequency distribution

1 Answers1