0

I have a dataframe which have list type columns, of equal lengths. I would like to do a dot product on these two columns without having to "explode" the lists (as it would take to much memory).

The following code returns results but seems to only dot product the first column with itself.

df = pl.DataFrame({
    "b": [[1,2,3], [2,3],[4],[1,2,3],[]],
    "c": [[1,1,1], [2,2],[4],[1,5,6],[]]
})
df.with_columns(pl.col('b').arr.eval(pl.element().dot(pl.col('b'))).alias('b.c'))
shape: (5, 3)
┌───────────┬───────────┬───────────┐
│ b         ┆ c         ┆ b.c       │
│ ---       ┆ ---       ┆ ---       │
│ list[i64] ┆ list[i64] ┆ list[i64] │
╞═══════════╪═══════════╪═══════════╡
│ [1, 2, 3] ┆ [1, 1, 1] ┆ [14]      │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ [2, 3]    ┆ [2, 2]    ┆ [13]      │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ [4]       ┆ [4]       ┆ [16]      │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ [1, 2, 3] ┆ [1, 5, 6] ┆ [14]      │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ []        ┆ []        ┆ [null]    │
└───────────┴───────────┴───────────┘
lowmotion
  • 1
  • 1

2 Answers2

4

Inside a List context (arr.eval), the arguments to pl.col are ignored, and pl.col becomes an alias to pl.element. Indeed, if we use pl.col("foobar"), we get the same result.

>>> df.with_columns(pl.col('b').arr.eval(pl.element().dot(pl.col('foobar'))).alias('b.c'))
shape: (5, 3)
┌───────────┬───────────┬───────────┐
│ b         ┆ c         ┆ b.c       │
│ ---       ┆ ---       ┆ ---       │
│ list[i64] ┆ list[i64] ┆ list[i64] │
╞═══════════╪═══════════╪═══════════╡
│ [1, 2, 3] ┆ [1, 1, 1] ┆ [14]      │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ [2, 3]    ┆ [2, 2]    ┆ [13]      │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ [4]       ┆ [4]       ┆ [16]      │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ [1, 2, 3] ┆ [1, 5, 6] ┆ [14]      │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ []        ┆ []        ┆ [null]    │
└───────────┴───────────┴───────────┘

As such, inside a List context for a particular column, we cannot directly access values from other columns.

I'll propose two rather RAM-hungry solutions, along with a way to tame the RAM requirements and still get great performance. (Far better than using apply)

Two Solutions

Solution #1: Stack the Lists

One way around the issue above is to use arr.concat to stack the two equal-sized lists into a single List. This way, all the necessary values for the dot calculation are in one List.

To access the values from b and c, we can use slice to address each half of the stacked List separately.

Because we are using arr.eval, the result of our dot computation is a column of type List, with one element (the dot product) in each List. To convert the List to a scalar, we can use arr.first.

(
    df
    .with_column(
        pl.col("b")
        .arr.concat("c")
        .arr.eval(
            pl.element()
            .slice(0, pl.count() // 2)
            .dot(
                pl.element()
                .slice(pl.count() // 2, pl.count() // 2)
            ),
            parallel=True
        )
        .arr.first()
        .alias('result')
    )
)
shape: (5, 3)
┌───────────┬───────────┬────────┐
│ b         ┆ c         ┆ result │
│ ---       ┆ ---       ┆ ---    │
│ list[i64] ┆ list[i64] ┆ i64    │
╞═══════════╪═══════════╪════════╡
│ [1, 2, 3] ┆ [1, 1, 1] ┆ 6      │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ [2, 3]    ┆ [2, 2]    ┆ 10     │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ [4]       ┆ [4]       ┆ 16     │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ [1, 2, 3] ┆ [1, 5, 6] ┆ 29     │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ []        ┆ []        ┆ null   │
└───────────┴───────────┴────────┘

Note the parallel=True keyword. Although this greatly speeds up the algorithm by parallelizing the work, it also inflates the RAM requirements.

Solution #2: Explode

I realize that you specifically did not want this option. But we can tame the RAM requirements and still get great performance.

One way we are taming RAM usage is to select only the columns needed before the explode step. Thus, we are not replicating values in all column of the DataFrame. We can then hstack the resulting one-column DataFrame onto our original DataFrame.

Note that the maintain_order=True is important, to maintain the proper ordering of the elements for the final hstack.

result = (
    df
    .select(['b', 'c'])
    .with_row_count()
    .explode(['b', 'c'])
    .groupby('row_nr', maintain_order=True)
    .agg(pl.col('b').dot('c').alias('result'))
    .drop('row_nr')
)

df.hstack(result)
>>> df.hstack(result)
shape: (5, 3)
┌───────────┬───────────┬────────┐
│ b         ┆ c         ┆ result │
│ ---       ┆ ---       ┆ ---    │
│ list[i64] ┆ list[i64] ┆ i64    │
╞═══════════╪═══════════╪════════╡
│ [1, 2, 3] ┆ [1, 1, 1] ┆ 6      │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ [2, 3]    ┆ [2, 2]    ┆ 10     │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ [4]       ┆ [4]       ┆ 16     │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ [1, 2, 3] ┆ [1, 5, 6] ┆ 29     │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ []        ┆ []        ┆ null   │
└───────────┴───────────┴────────┘

This algorithm is far less RAM-hungry than the first solution.

Taming the RAM requirements: slice

For large DataFrame and long Lists, the above two algorithms can use a lot of RAM.

One great way to tame this (and still get great performance) is to perform our algorithm on slices of the DataFrame and then concatenate the results. slice does not make copies of data, and as such is very fast.

We do pay a performance penalty for having to concatenate our results at the end. However, our results are a DataFrame with a single column (our dot-product result), so the concatenation step is not terribly time-consuming.

Here's the sliced versions of the algorithms. I've used a configurable slice_size variable for the number of rows in each slice. Setting this higher, yields a faster result but uses more RAM. Setting it lower saves RAM .. but at the price of wall-clock time.

Solution #1: Sliced

For the sliced version, I have replaced the with_columns with select, so that the result is a one-column DataFrame with only the dot-product.

slice_size = 10_000
result = pl.concat([
    df
    .slice(next_index, slice_size)
    .select(
        pl.col("b")
        .arr.concat("c")
        .arr.eval(
            pl.element()
            .slice(0, pl.count() // 2)
            .dot(
                pl.element()
                .slice(pl.count() // 2, pl.count() // 2)
            ),
            parallel=True
        )
        .arr.first()
        .alias('result')
    )
    for next_index in range(0, df.height, slice_size)
])

df.hstack(result)

Solution #2: Sliced

slice_size = 10_000
result = pl.concat([
    (
        df
        .slice(next_index, slice_size)
        .select(['b', 'c'])
        .with_row_count()
        .explode(['b', 'c'])
        .groupby('row_nr', maintain_order=True)
        .agg(pl.col('b').dot('c').alias('result'))
        .drop('row_nr')
    )
    for next_index in range(0, df.height, slice_size)
])

df.hstack(result)

I would suggest first trying the above two sliced solution on a small subset of your DataFrame. This should give you an idea as to the best value for slice_size given your computing platform.

Performance

So, how do these perform? I'll take your example dataset and inflate it to 100 million records.

nbr_records = 100_000_000
nbr_groups = nbr_records // 5
df = (
    pl.DataFrame({'group': pl.arange(0, nbr_groups, eager=True)})
    .join(
        pl.DataFrame({
            "b": [[1, 2, 3], [2, 3], [4], [1, 2, 3], []],
            "c": [[1, 1, 1], [2, 2], [4], [1, 5, 6], []],
        }),
        how="cross"
    )
)
df
shape: (100000000, 3)
┌──────────┬───────────┬───────────┐
│ group    ┆ b         ┆ c         │
│ ---      ┆ ---       ┆ ---       │
│ i64      ┆ list[i64] ┆ list[i64] │
╞══════════╪═══════════╪═══════════╡
│ 0        ┆ [1, 2, 3] ┆ [1, 1, 1] │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 0        ┆ [2, 3]    ┆ [2, 2]    │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 0        ┆ [4]       ┆ [4]       │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 0        ┆ [1, 2, 3] ┆ [1, 5, 6] │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ...      ┆ ...       ┆ ...       │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 19999999 ┆ [2, 3]    ┆ [2, 2]    │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 19999999 ┆ [4]       ┆ [4]       │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 19999999 ┆ [1, 2, 3] ┆ [1, 5, 6] │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 19999999 ┆ []        ┆ []        │
└──────────┴───────────┴───────────┘

I timed each of these algorithms using various slice sizes in the table below.

For reference, my computing platform is a 32-core Threadripper Pro with 512 GB of RAM.

Before constructing the 100 million row DataFrame, my RAM usage is roughly 4.7 GB. When the 100 million row DataFrame is loaded, my RAM usage rises to about 9.7 GB - before the algorithms are run.

For each run, I recorded the maximum observed RAM, to give an idea as to how RAM-hungry the algorithms are with various slice sizes.

Algorithm Slice size wall-clock time maximum observed RAM
#1 1,000,000 95 sec 11.0 GB
#1 10,000,000 91 sec 15.3 GB
#1 100,000,000 80 sec 54.4 GB
#2 1,000,000 85 sec 10.5 GB
#2 10,000,000 85 sec 11.6 GB
#2 100,000,000 84 sec 21.3 GB

One notable result: the explode solution does not suffer significantly when the algorithm is applied in slices.

Obviously, how each of these performs on your actual data and your computing platform will differ from the numbers above. And depending on the size of your lists, you may see one perform significantly better than the other.

Compared to using apply

How much faster are the above two solutions than simply using apply and the numpy dot function on our 100 million records?

(
    df
    .with_column(
        pl.struct(['b', 'c'])
        .apply(lambda cols: np.dot(cols['b'], cols['c']))
        .alias('result')
    )
)

Time: 390 seconds

  • As of Polars Version 0.18.0, I can not reproduce your results. In fact, your 'Solution #1: Stack the Lists' code leads to the same wrong result as your first example with the results column [14, 13, 16, 14, null]. Is there an updated alternative way to compute the dot product the Polars way? – Joel Beck Jun 07 '23 at 10:00
0

You can use apply on the columns and then do the dot product using numpy like below. It has better readability also.

def dot_product(a, b):
    return np.dot(np.array(a), np.array(b))

df["b.c"] = df.apply(lambda x: dot_product(x.b, x.c), axis=1)
Arunesh Singh
  • 3,489
  • 18
  • 26