Let P and Q be two finite probability distributions on integers, with support between 0 and some large integer N. The one-dimensional earth mover's distance between P and Q is the minimum cost you have to pay to transform P into Q, considering that it costs r*|n-m| to "move" a probability r associated to integer n to another integer m.
There is a simple algorithm to compute this. In pseudocode:
previous = 0
sum = 0
for i from 0 to N:
previous = P(i) - Q(i) + previous
sum = sum + abs(previous) // abs = absolute value
return sum
Now, suppose you have two tables that contain each a probability distribution. Column n
contains integers, and column p
contains the corresponding probability. The tables are correct (all probabilities are between 0 and 1, their sum is I want to compute the earth mover's distance between these two tables in BigQuery (Standard SQL).
- Is it possible? I feel like one would need to use analytical functions, but I don't have much experience with them, so I don't know how to get there.
- What if N (the maximum integers) is very large, but my tables are not? Can we adapt the solution to avoid doing a computation for each integer i?