How to down sample data keeping the same frequency distribution of one column

Question

I am trying to down sample a data set and keep the same frequency distribution as one of the column. The approach is 1) determine the baseline frequency distribution, 2) use the baseline frequency distribution to sample the row range, 3) use the sampled row range to select rows from the baseline data frame, 4) compare the baseline and down sampled frequency distributions. Here are two examples. In both examples, the events with the highest probability are oversampled, and the remaining events are under-sampled.

Function to resample the data keeping one column's frequency the same

sampFreq<-function(df,col,ns) {
  x<-as.factor(df[,col])
  freq_x<-table(x)
  prob_x<-freq_x/sum(freq_x)
  df_prob = prob_x[as.factor(df[,col])]
  nr=nrow(df)
  samp_rows = sample(1:nr,ns,replace=FALSE,prob=df_prob)
  return(df[samp_rows,])
}

Example 1

Steps 1) Specify the target frequency distribution 2) Convert to probability 3) Generate data with the target frequency distribution 4) Down sample the data using function above

cfreq_1=c(1,2,3,4,5,4,3,2,1)
freq_1 = matrix(cfreq_1, nrow = 1, ncol = length(cfreq_1), byrow = TRUE,
               dimnames = list(c("row1" ),
                               c(as.character(4+(1:length(cfreq_1))))))
pr_1=freq_1/sum(freq_1)
set.seed(31)
ns=5000
df_1a<-data.frame(nbr = sample(4+(1:length(pr_1)),ns,
                               replace=TRUE,prob=pr_1),
                  ord=1:ns)
df_1b<-sampFreq(df_1a, "nbr", 1000)

5) Get the frequency of the simulated and down sampled data 6) Sort the frequencies based on numeric values of the dimension names

tb_1a<-table(df_1a$nbr)
tb_1b<-table(df_1b$nbr)
s_tb_1a<-tb_1a[order(as.numeric(attr(tb_1a,"dimnames")[[1]]))]
s_tb_1b<-tb_1b[order(as.numeric(attr(tb_1b,"dimnames")[[1]]))]

7) Plot the specified probabilities, and the probabilities from the data and down sampled

plot(as.numeric(attr(pr_1,"dimnames")[[2]]),pr_1,log="y",ylim=c(.01,.3),
     cex=1.5,pch=15,col="black",type="o", lty=2, 
     xlab='event',ylab='Probability',main="Example 1, Oversample high prob, undersample low")
points(as.numeric(attr(tb_1a,"dimnames")[[1]]),s_tb_1a/sum(s_tb_1a),
       cex=1.5,pch=16,col="blue",type="o", lty=2)
points(as.numeric(attr(tb_1b,"dimnames")[[1]]),s_tb_1b/sum(s_tb_1b),
       cex=1.5,pch=17,col="red",type="o", lty=1)
legend("topleft",c("prescribed", "data", "sampled"),pch=c(15,16,17),
       col=c("black","blue","red"),lty=c(2,2,1))
grid()

Notice that the events with the highest probability are oversampled, while the other events are under sampled (red curve).

enter image description here

Example 2

txt = "0.028506949  0.059389476  0.285069486  0.282693907  0.242309063  2.974224967
 0.064140634  0.002375579  0.019004632  0.280318328  0.033258107  0.073642950
  0.007126737  0.007126737 39.045017223  2.261551253  0.052262739  0.045136002
  0.014253474  0.035633686  5.223898325  1.073761729  4.150136596  0.009502316
  5.038603160  1.021498990  4.017104169  0.002375579  0.073642950  1.197291840
  0.501247179  0.052262739  0.776814348  0.071267371  8.416676565  0.026131370
  0.019004632  0.002375579  0.168666112  0.023755790  5.718018767  0.501247179
  0.014253474  0.776814348  0.071267371  8.416676565  0.026131370  0.002375579
  0.002375579  0.168666112  0.023755790  5.718018767  0.194797482  0.028506949
  0.137783585  0.016629053  0.002375579  0.494120442  0.007126737  "

# Here is the target frequency distribution
cfreq_2=scan(text=txt,multi.line =TRUE)
freq_2 = matrix(cfreq_2, nrow = 1, ncol = length(cfreq_2), byrow = TRUE,
              dimnames = list(c("row1" ),
                              c(as.character(4+(1:length(cfreq_2))))))
# Convert to probability
pr_2=freq_2/sum(freq_2)

# Generate some data
ns=42095
df_2a<-data.frame(nbr = sample(4+(1:length(pr_2)),ns,
                               replace=TRUE,prob=pr_2),
                  ord=1:ns)
df_2b<-sampFreq(df_2a, "nbr", 10000)

tb_2a<-table(df_2a$nbr)
tb_2b<-table(df_2b$nbr)
s_tb_2a<-tb_2a[order(as.numeric(attr(tb_2a,"dimnames")[[1]]))]
s_tb_2b<-tb_2b[order(as.numeric(attr(tb_2b,"dimnames")[[1]]))]
plot(as.numeric(attr(pr_2,"dimnames")[[2]]),pr_2,log="y",ylim=c(.00001,.7),
     cex=1.5,pch=15,col="black",type="o", lty=2, 
     xlab='event',ylab='Probability',main="Example 2, Oversampled Point With High Prob, Undersampled Others")
points(as.numeric(attr(tb_2a,"dimnames")[[1]]),s_tb_2a/sum(s_tb_2a),
       cex=1.5,pch=16,col="blue",type="o", lty=2)
points(as.numeric(attr(tb_2b,"dimnames")[[1]]),s_tb_2b/sum(s_tb_2b),
       cex=1.5,pch=17,col="red",type="o", lty=1)
legend("topleft",c("prescribed", "data", "sampled"),pch=c(15,16,17),
       col=c("black","blue","red"),lty=c(2,2,1))
grid()

Here, there is only one event that is oversampled, while the remaining events are under sampled.

enter image description here

The question is why isn't the red lines closer to the other lines, and also there seems to be a systematic error.

The infrequent elements that have small frequencies are important to match/obtained, it is less important to match / obtain the frequencies of elements that occur frequently (large frequencies).

THis code is very confusing and i'm not sure i've described your intent very well so know what you really want it to do. You do realize that `sum(pr)=100`, right? So it's not really a probability because otherwise it should sum to 1, so it's really more of a weight. But `sum(pr)` is also one number, so you're multiplying the `tb_1*_chk` values by a constant and ignoring the expected probabilities. I really can't figure out what you're trying to do. — MrFlick, Oct 21 '14 at 03:15
@MrFlick Thanks for the feedback. The goal is to extract a subset that maintains the frequency distribution of one column. The steps are: 1) determine the frequency distribution of a column in a data frame, 2) down sample the row range with that frequency distribution, 3) obtain a down sampled data frame using the row sample, 4) compare the frequency distribution of the original and down sampled frames. pr represents an expected frequency distribution, pr/const is the same distribution, so pr/sum(pr) is probability. — , Oct 21 '14 at 08:29
To demonstrate the function, two different frequency distributions are artificially demonstrated for tables of different length. During the night, I did think of a different approach: convert example 1 into a function, and then give the function the inputs for example 1 and example 2. I'll add that function at the end of the question, and document the function better. Also, add plots of the frequency distributions. — , Oct 21 '14 at 08:30

score 1 · Answer 1 · answered Oct 21 '14 at 22:58

The following function gives the desired results.

sampFreq<-function(cdf,col,ns) {
  x<-as.factor(cdf[,col])
  freq_x<-table(x)
  prob_x<-freq_x/sum(freq_x)
  df_prob = prob_x[as.factor(cdf[,col])]
  nr=nrow(cdf)
  sLevels = levels(as.factor(cdf[,col]))
  nLevels = length(sLevels)
  rat = ns/nr
  rdata = NULL
  for (is in seq(1,nLevels)) {
    ldata <- cdf[cdf[,col]==sLevels[is],]
    ndata <- nrow(ldata)
    nsdata = max(ndata*rat,1)
    srows <- sample(seq(1,ndata),nsdata,replace=rat>1)
    sdata <- ldata[srows,]
    rdata <- rbind(rdata,sdata)
  }
  return(rdata)
}

Example 1 enter image description here Example 2

How to down sample data keeping the same frequency distribution of one column

Example 1

Example 2

1 Answers1