1

Let P and Q be two finite probability distributions on integers, with support between 0 and some large integer N. The one-dimensional earth mover's distance between P and Q is the minimum cost you have to pay to transform P into Q, considering that it costs r*|n-m| to "move" a probability r associated to integer n to another integer m.

There is a simple algorithm to compute this. In pseudocode:

previous = 0
sum = 0
for i from 0 to N:
    previous = P(i) - Q(i) + previous
    sum = sum + abs(previous)         // abs = absolute value
return sum

Now, suppose you have two tables that contain each a probability distribution. Column n contains integers, and column p contains the corresponding probability. The tables are correct (all probabilities are between 0 and 1, their sum is I want to compute the earth mover's distance between these two tables in BigQuery (Standard SQL).

  1. Is it possible? I feel like one would need to use analytical functions, but I don't have much experience with them, so I don't know how to get there.
  2. What if N (the maximum integers) is very large, but my tables are not? Can we adapt the solution to avoid doing a computation for each integer i?
Ted
  • 972
  • 2
  • 11
  • 20

2 Answers2

1

Hopefully I fully understand your problem. This seems to be what you're looking for:

WITH Aggr AS (
  SELECT rp.n AS n, SUM(rp.p - rq.p)
  OVER(ORDER BY rp.n ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS emd
  FROM P rp
  LEFT JOIN Q rq
  ON rp.n = rq.n
) SELECT SUM(ABS(a.emd)) AS total_emd
FROM Aggr a;

WRT question #2, note that we only scan what's actually in tables, regardless of the N, assuming a one-to-one match for every n in P with n in Q.

  • I think there are a few issues with this: 1. This is a LEFT JOIN, so if P is missing some rows (implicitly equal to zero), this probably won't work? 2. The computation doesn't seem to depend on the value of n, which is suspicious (the earth mover distance depends on "how far" the values of n are from each other). So it seems to work only if all possible values of n (between 0 or N) are represented in the table. – Ted Feb 27 '18 at 13:34
0

I adapted Michael's answer to fix its issues, here's the solution I ended up with. Suppose the integers are stored in column i and the probability in column p. First I join the two tables, then I compute EMD(i) for all i using the window, then I sum all absolute values.

WITH
joined_table AS (
  SELECT
    IFNULL(table1.i, table2.i) AS i,
    IFNULL(table1.p, 0) AS p,
    IFNULL(table2.p, 0) AS q,
  FROM table1
  OUTER JOIN table2
  ON table1.i = table2.i
),
aggr AS (
  SELECT
    (SUM(p-q) OVER win) * (i - (LAG(i,1) OVER win)) AS emd
  FROM joined_table
  WINDOW win AS (
    ORDER BY i
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  )
)
SELECT SUM(ABS(emd)) AS total_emd
FROM aggr
Ted
  • 972
  • 2
  • 11
  • 20
  • thanks for the solution. Any ideas on how to proceed if the arrays have different length? – Gregory Mazur Feb 23 '21 at 13:21
  • It's been a while, but I think the solution should work as is? The initial join doesn't depend on the assumption that both arrays have the same length. – Ted Feb 23 '21 at 16:00
  • Well maybe I`m implementing it wrong. I have two sets of values with no 'i' that would match them, so in order to match these I create a row_number as follows for both sets: (SELECT row_number() over(ORDER BY s1.scored_usage) as no, s1.scored_usage from scores s1) s2 – Gregory Mazur Feb 24 '21 at 09:50
  • What's wrong with using the solution above (the `joined_table` expression that does the outer join) instead of using `row_number()`? – Ted Feb 24 '21 at 09:54
  • I dont have 'i' only table1.p table2.p – Gregory Mazur Feb 24 '21 at 16:14
  • Well then how do you know which index each p is associated with? – Ted Feb 24 '21 at 22:47
  • they are not associated, wanted to repeat behavior of scipy.wasserstein_disnance() – Gregory Mazur Feb 25 '21 at 10:34