5

Assume that I have a vector, V, of positive integers. If the sum of the integers are larger than a positive integer N, I want to rescale the integers in V so that the sum is <= N. The elements in V must remain above zero. The length of V is guaranteed to be <= N.

Is there an algorithm to perform this rescaling in linear time?

This is not homework, BTW :). I need to rescale a map from symbols to symbol frequencies to use range encoding.

Some quick thinking and googling has not given a solution to the problem.

EDIT:

Ok, the question was somewhat unclear. "Rescale" means "normalize". That is, transform the integers in V, for example by multiplying them by a constant, to smaller positive integers so the criterion of sum(V) <= N is fulfilled. The better the ratios between the integers are preserved, the better the compression will be.

The problem is open-ended in that way, the method does not need to find the optimal (in, say, a least squares fit sense) way to preserve the ratios, but a "good" one. Setting the entire vector to 1, as suggested, is not acceptable (unless forced). "Good" enough would for example be finding the smallest divisor (defined below) that fulfills the sum criterion.

The following naive algorithm does not work.

  1. Find the current sum(V), Sv
  2. divisor := int(ceil(Sv/N))
  3. Divide each integer in V by divisor, rounding down, but not to less than 1.

This fails on v = [1,1,1,10] with N = 5.

divisor = ceil(13 / 5) = 3.
V := [1,1,1, max(1, floor(10/3)) = 3]
Sv is now 6 > 5.

In this case, the correct normalization is [1,1,1,2]

One algorithm that would work is to do a binary search for divisor (defined above) until the smallest divisor in [1,N] fulfilling the sum criterion is found. Starting with the ceil(Sv/N) guess. This is however, not linear in number of operations, but proportional to len(V)*log(len(V)).

I am starting to think that it is impossible to do well, in linear time, in the general case. I might resort to some sort of heuristic.

Gurgeh
  • 2,130
  • 15
  • 28

4 Answers4

5

Just divide all the integers by their Greatest Common Divisor. You can find the GCD efficiently with multiple applications of Euclid's Algorithm.

d = 0
for x in xs:
    d = gcd(d, x)

xs = [x/d for x in xs]

The positive point is that you always have a small as possible representation this way, without throwing away any precision and without needing to choose a specific N. The downside is that if your frequencies are large coprime numbers you will have no choice but to sacrifice precision (and you didn't specify what should be done in this case).

hugomg
  • 68,213
  • 24
  • 160
  • 246
  • I have now clarified my question, see EDIT above. In the light of this clarification, this method would suck, because a 1 in the vector would turn the whole vector into just 1s. – Gurgeh May 16 '11 at 18:18
  • I do not see why a 1 would turn everything into 1s, as then everything is divided by 1 (and thus stays as is). But another problem of his solution is that the sum needn't be smaller than N. It is just the smallest possible solution without changing the ratios of the values, but for some configurations you just have to change these ratios and thus loose some precision. – Christian Rau May 16 '11 at 19:49
  • Yes, you are absolutely right Christian. I stupidly switched my counter arguments for this method and another. However, it still does not work. Losing precision is not the problem. It is finding a reliable way to switch these ratios, that does not include trial and error, that is the problem. – Gurgeh May 17 '11 at 07:50
  • 1
    @Gurgeh: Unless you can decide mathematically what you want to do, it is hard to find something more specific than just dividing by GCD. Why is it so important that numbers must be stored as integers. Can't you use floats or rationals (fractions)? – hugomg May 17 '11 at 13:24
  • If you do not require vector components to be integers, the solution is very easy as i showed in my post. – ascanio May 17 '11 at 14:42
1

I think you should just rescale the part above 1. So, subtract 1 from all values, and V.length from N. Then rescale normally, then add 1 back. You can even do slightly better if you keep running totals as you go along, instead of choosing just one factor, which will usually waste some "number space". Something like this:

public static void rescale(int[] data, int N) {
    int sum = 0;
    for (int d : data)
        sum += d;

    if (sum > N) {
        int n = N - data.length;
        sum -= data.length;

        for (int a = 0; a < data.length; a++) {
            int toScale = data[a] - 1;
            int scaled = Math.round(toScale * (float) n / sum);

            data[a] = scaled + 1;
            n -= scaled;
            sum -= toScale;
        }
    }
}
xs0
  • 2,990
  • 17
  • 25
  • It depends on what you mean by "rescale". – thomson_matt May 16 '11 at 17:07
  • well, I guess it means that they should be multiplied with the same factor and maybe rounded.. the question is unclear, though - setting everything to 1 would also produce a sum <= N.. – xs0 May 16 '11 at 17:11
  • The problem is that the scale factor will not scale as much as you think, because the 1s will stay the same. – Gurgeh May 16 '11 at 19:07
1

How about this:

  1. Find the current sum(V), Sv
  2. divisor := int(ceil(Sv/(N - |V| + 1))
  3. Divide each integer in V by divisor, rounding up

On v = [1,1,1,10] with N = 5:

divisor = ceil(13 / 2) = 7. V := [1,1,1, ceil(10/7)) = 2]

starblue
  • 55,348
  • 14
  • 97
  • 151
  • Best comment so far! It will not give the optimal divisor (let alone the optimal float divisor, which would be even better), but it should at least fulfill the sum criterion. – Gurgeh May 17 '11 at 08:05
  • This algorithm is, incidentally, a simplification of a more exact algorithm that I was considering. But having tried out yours, it seems to preserve the ratios reasonably, so I might go with that anyway. Since I know that divisor will never be very large, let us say no more than 100. I can create a counting vector, C, that has 100 zeroes. Then run through V, doing V[x]++ for x in V below 100. Finally I can calculate an optimal float divisor by substituting |V| for sum(C2), where C2 is the range of C with index below divisor. – Gurgeh May 17 '11 at 08:07
0

This is a problem of 'range normalization', but it's very easy. Suppose that S is the sum of the elements of the vector, and S>=N, then S=dN, for some d>=1. Therefore d=S/N. So just multiply every element of the vector by N/S (i.e. divide by d). The result is a vector with rescaled components which sum is exactly N. This procedure is clearly linear :)

ascanio
  • 1,506
  • 1
  • 9
  • 18
  • I see, but you didn't requested integers in your original post:) Maybe one could start rescaling do doubles, and then apply some form of 'quantization' which respects the following two rules: a) any element should be greater than 1, and b) the sum of all elements should not exceed N. But it's clear that in some cases (such your counter-example) you will lose a lot of information. – ascanio May 16 '11 at 18:31
  • Even his question title requests integers, not to forget his original post. – Christian Rau May 16 '11 at 19:45