2

I want to calculate the median of a frequency distribution for a large number of samples. Each of the samples have a number of classes (3 in the example) and their respective frequencies. Each of the classes is associated with a different value

data <- data.frame(sample=c(1,2,3,4,5), 
                   freq_class1=c(1,1,59,10,2), 
                   freq_class2=c(1,0,35,44,22), 
                   freq_class3=c(0,4,1,9,2), 
                   value_class1=c(12,11,14,11,13), 
                   value_class2=c(27,33,34,31,29), 
                   value_class3=c(75,78,88,81,65))

For example the median of sample 1 would be 19.5. I assume that this can be done using quantile() on the frequency distribution of each sample, but all attempts failed.

Do you have any suggestion?

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
user12975
  • 121
  • 1
  • 2
  • 9
  • Can you please explain how you calculate the median to be 19.5? Since the values of class 1 have a max of 14, surely the median must be lower than 14. Please explain what your data means. – Andrie Jan 22 '13 at 18:03
  • 2
    @Andrie his first sample has 1 value of 12 and 1 value of 27 (samples are rows, not columns--it's a strange way to set up the data set...). – Jonathan Christensen Jan 22 '13 at 18:04
  • @JonathanChristensen Aha – Andrie Jan 22 '13 at 18:05
  • There are 37000 samples, so they are organized in rows to make things easier to grasp :) – user12975 Jan 22 '13 at 18:08
  • 1
    @user12975 Mind if I ask what kind of data this is? I can't help but be curious about data that only takes three values in each sample no matter how large the sample, but the three values are different every time... – Jonathan Christensen Jan 22 '13 at 18:11
  • 1
    In reality there are 8 classes per sample. Each sample is a "census unit". I know how many properties between given sizes there are in each sample, and what is their average size per sample and class. Two different samples usually have different average sizes. The mess is a consequence of the data provider doing its best to aggregate nominal data to avoid researchers knowing too much about what people own or deforest. – user12975 Jan 22 '13 at 18:18

1 Answers1

4

This is probably not the most elegant way, but it works: basically, I'm recreating the full data vector from the information contained in the data.frame, then finding the median of that. Writing a function to do it lets me use apply to quickly do it to each row of the data.frame.

find.median <- function(x) {
  full.x <- rep(x[5:7],times=x[2:4])
  return(median(full.x))
}

> apply(data,1,find.median)
[1] 19.5 78.0 14.0 31.0 29.0
Jonathan Christensen
  • 3,756
  • 1
  • 19
  • 16
  • Thanks a lot! I am still trying to understand how it goes though, it has nothing to do with what I was trying... – user12975 Jan 22 '13 at 18:22