I am trying to down sample a data set and keep the same frequency distribution as one of the column. The approach is 1) determine the baseline frequency distribution, 2) use the baseline frequency distribution to sample the row range, 3) use the sampled row range to select rows from the baseline data frame, 4) compare the baseline and down sampled frequency distributions. Here are two examples. In both examples, the events with the highest probability are oversampled, and the remaining events are under-sampled.
Function to resample the data keeping one column's frequency the same
sampFreq<-function(df,col,ns) {
x<-as.factor(df[,col])
freq_x<-table(x)
prob_x<-freq_x/sum(freq_x)
df_prob = prob_x[as.factor(df[,col])]
nr=nrow(df)
samp_rows = sample(1:nr,ns,replace=FALSE,prob=df_prob)
return(df[samp_rows,])
}
Example 1
Steps 1) Specify the target frequency distribution 2) Convert to probability 3) Generate data with the target frequency distribution 4) Down sample the data using function above
cfreq_1=c(1,2,3,4,5,4,3,2,1)
freq_1 = matrix(cfreq_1, nrow = 1, ncol = length(cfreq_1), byrow = TRUE,
dimnames = list(c("row1" ),
c(as.character(4+(1:length(cfreq_1))))))
pr_1=freq_1/sum(freq_1)
set.seed(31)
ns=5000
df_1a<-data.frame(nbr = sample(4+(1:length(pr_1)),ns,
replace=TRUE,prob=pr_1),
ord=1:ns)
df_1b<-sampFreq(df_1a, "nbr", 1000)
5) Get the frequency of the simulated and down sampled data 6) Sort the frequencies based on numeric values of the dimension names
tb_1a<-table(df_1a$nbr)
tb_1b<-table(df_1b$nbr)
s_tb_1a<-tb_1a[order(as.numeric(attr(tb_1a,"dimnames")[[1]]))]
s_tb_1b<-tb_1b[order(as.numeric(attr(tb_1b,"dimnames")[[1]]))]
7) Plot the specified probabilities, and the probabilities from the data and down sampled
plot(as.numeric(attr(pr_1,"dimnames")[[2]]),pr_1,log="y",ylim=c(.01,.3),
cex=1.5,pch=15,col="black",type="o", lty=2,
xlab='event',ylab='Probability',main="Example 1, Oversample high prob, undersample low")
points(as.numeric(attr(tb_1a,"dimnames")[[1]]),s_tb_1a/sum(s_tb_1a),
cex=1.5,pch=16,col="blue",type="o", lty=2)
points(as.numeric(attr(tb_1b,"dimnames")[[1]]),s_tb_1b/sum(s_tb_1b),
cex=1.5,pch=17,col="red",type="o", lty=1)
legend("topleft",c("prescribed", "data", "sampled"),pch=c(15,16,17),
col=c("black","blue","red"),lty=c(2,2,1))
grid()
Notice that the events with the highest probability are oversampled, while the other events are under sampled (red curve).
Example 2
txt = "0.028506949 0.059389476 0.285069486 0.282693907 0.242309063 2.974224967
0.064140634 0.002375579 0.019004632 0.280318328 0.033258107 0.073642950
0.007126737 0.007126737 39.045017223 2.261551253 0.052262739 0.045136002
0.014253474 0.035633686 5.223898325 1.073761729 4.150136596 0.009502316
5.038603160 1.021498990 4.017104169 0.002375579 0.073642950 1.197291840
0.501247179 0.052262739 0.776814348 0.071267371 8.416676565 0.026131370
0.019004632 0.002375579 0.168666112 0.023755790 5.718018767 0.501247179
0.014253474 0.776814348 0.071267371 8.416676565 0.026131370 0.002375579
0.002375579 0.168666112 0.023755790 5.718018767 0.194797482 0.028506949
0.137783585 0.016629053 0.002375579 0.494120442 0.007126737 "
# Here is the target frequency distribution
cfreq_2=scan(text=txt,multi.line =TRUE)
freq_2 = matrix(cfreq_2, nrow = 1, ncol = length(cfreq_2), byrow = TRUE,
dimnames = list(c("row1" ),
c(as.character(4+(1:length(cfreq_2))))))
# Convert to probability
pr_2=freq_2/sum(freq_2)
# Generate some data
ns=42095
df_2a<-data.frame(nbr = sample(4+(1:length(pr_2)),ns,
replace=TRUE,prob=pr_2),
ord=1:ns)
df_2b<-sampFreq(df_2a, "nbr", 10000)
tb_2a<-table(df_2a$nbr)
tb_2b<-table(df_2b$nbr)
s_tb_2a<-tb_2a[order(as.numeric(attr(tb_2a,"dimnames")[[1]]))]
s_tb_2b<-tb_2b[order(as.numeric(attr(tb_2b,"dimnames")[[1]]))]
plot(as.numeric(attr(pr_2,"dimnames")[[2]]),pr_2,log="y",ylim=c(.00001,.7),
cex=1.5,pch=15,col="black",type="o", lty=2,
xlab='event',ylab='Probability',main="Example 2, Oversampled Point With High Prob, Undersampled Others")
points(as.numeric(attr(tb_2a,"dimnames")[[1]]),s_tb_2a/sum(s_tb_2a),
cex=1.5,pch=16,col="blue",type="o", lty=2)
points(as.numeric(attr(tb_2b,"dimnames")[[1]]),s_tb_2b/sum(s_tb_2b),
cex=1.5,pch=17,col="red",type="o", lty=1)
legend("topleft",c("prescribed", "data", "sampled"),pch=c(15,16,17),
col=c("black","blue","red"),lty=c(2,2,1))
grid()
Here, there is only one event that is oversampled, while the remaining events are under sampled.
The question is why isn't the red lines closer to the other lines, and also there seems to be a systematic error.
The infrequent elements that have small frequencies are important to match/obtained, it is less important to match / obtain the frequencies of elements that occur frequently (large frequencies).