I have computed a vector of the frequency of different events, represented as fractions and sorted in descending order. I need to interface to a tool that requires positive integer percentages that must sum up to exactly 100. I would like to generate the percentages in a fashion that best represents the input distribution. That is, I would like relationship (ratios) among the percentages to best match the one in the input fractions, despite any non-linearities resulting in cutting a long tail.
I have a function that generates these percentages, but I don't think it is optimal or elegant. In particular, I would like to do more of the work in numeric space before resorting to "stupid integer tricks".
Here is an example frequency vector:
fractionals <- 1 / (2 ^ c(2, 5:6, 8, rep(9,358)))
And here is my function:
# Convert vector of fractions to integer percents summing to 100
percentize <- function(fractionals) {
# fractionals is sorted descending and adds up to 1
# drop elements that wouldn't round up to 1% vs. running total
pctOfCum <- fractionals / cumsum(fractionals)
fractionals <- fractionals[pctOfCum > 0.005]
# calculate initial percentages
percentages <- round((fractionals / sum(fractionals)) * 100)
# if sum of percentages exceeds 100, remove proportionally
i <- 1
while (sum(percentages) > 100) {
excess <- sum(percentages) - 100
if (i > length(percentages)) {
i <- 1
}
partialExcess <- max(1, round((excess * percentages[i]) / 100))
percentages[i] <- percentages[i] - min(partialExcess,
percentages[i] - 1)
i <- i + 1
}
# if sum of percentages shorts 100, add proportionally
i <- 1
while (sum(percentages) < 100) {
shortage <- 100 - sum(percentages)
if (i > length(percentages)) {
i <- 1
}
partialShortage <- max(1, round((shortage * percentages[i]) / 100))
percentages[i] <- percentages[i] + partialShortage
i <- i + 1
}
return(percentages)
}
Any ideas?