Inside a List context (arr.eval
), the arguments to pl.col
are ignored, and pl.col
becomes an alias to pl.element
. Indeed, if we use pl.col("foobar")
, we get the same result.
>>> df.with_columns(pl.col('b').arr.eval(pl.element().dot(pl.col('foobar'))).alias('b.c'))
shape: (5, 3)
┌───────────┬───────────┬───────────┐
│ b ┆ c ┆ b.c │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ list[i64] ┆ list[i64] │
╞═══════════╪═══════════╪═══════════╡
│ [1, 2, 3] ┆ [1, 1, 1] ┆ [14] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ [2, 3] ┆ [2, 2] ┆ [13] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ [4] ┆ [4] ┆ [16] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ [1, 2, 3] ┆ [1, 5, 6] ┆ [14] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ [] ┆ [] ┆ [null] │
└───────────┴───────────┴───────────┘
As such, inside a List context for a particular column, we cannot directly access values from other columns.
I'll propose two rather RAM-hungry solutions, along with a way to tame the RAM requirements and still get great performance. (Far better than using apply
)
Two Solutions
Solution #1: Stack the Lists
One way around the issue above is to use arr.concat
to stack the two equal-sized lists into a single List. This way, all the necessary values for the dot
calculation are in one List.
To access the values from b
and c
, we can use slice
to address each half of the stacked List separately.
Because we are using arr.eval
, the result of our dot
computation is a column of type List, with one element (the dot product) in each List. To convert the List to a scalar, we can use arr.first
.
(
df
.with_column(
pl.col("b")
.arr.concat("c")
.arr.eval(
pl.element()
.slice(0, pl.count() // 2)
.dot(
pl.element()
.slice(pl.count() // 2, pl.count() // 2)
),
parallel=True
)
.arr.first()
.alias('result')
)
)
shape: (5, 3)
┌───────────┬───────────┬────────┐
│ b ┆ c ┆ result │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ list[i64] ┆ i64 │
╞═══════════╪═══════════╪════════╡
│ [1, 2, 3] ┆ [1, 1, 1] ┆ 6 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ [2, 3] ┆ [2, 2] ┆ 10 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ [4] ┆ [4] ┆ 16 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ [1, 2, 3] ┆ [1, 5, 6] ┆ 29 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ [] ┆ [] ┆ null │
└───────────┴───────────┴────────┘
Note the parallel=True
keyword. Although this greatly speeds up the algorithm by parallelizing the work, it also inflates the RAM requirements.
Solution #2: Explode
I realize that you specifically did not want this option. But we can tame the RAM requirements and still get great performance.
One way we are taming RAM usage is to select only the columns needed before the explode
step. Thus, we are not replicating values in all column of the DataFrame. We can then hstack
the resulting one-column DataFrame onto our original DataFrame.
Note that the maintain_order=True
is important, to maintain the proper ordering of the elements for the final hstack
.
result = (
df
.select(['b', 'c'])
.with_row_count()
.explode(['b', 'c'])
.groupby('row_nr', maintain_order=True)
.agg(pl.col('b').dot('c').alias('result'))
.drop('row_nr')
)
df.hstack(result)
>>> df.hstack(result)
shape: (5, 3)
┌───────────┬───────────┬────────┐
│ b ┆ c ┆ result │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ list[i64] ┆ i64 │
╞═══════════╪═══════════╪════════╡
│ [1, 2, 3] ┆ [1, 1, 1] ┆ 6 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ [2, 3] ┆ [2, 2] ┆ 10 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ [4] ┆ [4] ┆ 16 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ [1, 2, 3] ┆ [1, 5, 6] ┆ 29 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ [] ┆ [] ┆ null │
└───────────┴───────────┴────────┘
This algorithm is far less RAM-hungry than the first solution.
Taming the RAM requirements: slice
For large DataFrame and long Lists, the above two algorithms can use a lot of RAM.
One great way to tame this (and still get great performance) is to perform our algorithm on slices of the DataFrame and then concatenate the results. slice
does not make copies of data, and as such is very fast.
We do pay a performance penalty for having to concatenate our results at the end. However, our results are a DataFrame with a single column (our dot-product result), so the concatenation step is not terribly time-consuming.
Here's the sliced versions of the algorithms. I've used a configurable slice_size
variable for the number of rows in each slice. Setting this higher, yields a faster result but uses more RAM. Setting it lower saves RAM .. but at the price of wall-clock time.
Solution #1: Sliced
For the sliced version, I have replaced the with_columns
with select
, so that the result is a one-column DataFrame with only the dot-product.
slice_size = 10_000
result = pl.concat([
df
.slice(next_index, slice_size)
.select(
pl.col("b")
.arr.concat("c")
.arr.eval(
pl.element()
.slice(0, pl.count() // 2)
.dot(
pl.element()
.slice(pl.count() // 2, pl.count() // 2)
),
parallel=True
)
.arr.first()
.alias('result')
)
for next_index in range(0, df.height, slice_size)
])
df.hstack(result)
Solution #2: Sliced
slice_size = 10_000
result = pl.concat([
(
df
.slice(next_index, slice_size)
.select(['b', 'c'])
.with_row_count()
.explode(['b', 'c'])
.groupby('row_nr', maintain_order=True)
.agg(pl.col('b').dot('c').alias('result'))
.drop('row_nr')
)
for next_index in range(0, df.height, slice_size)
])
df.hstack(result)
I would suggest first trying the above two sliced solution on a small subset of your DataFrame. This should give you an idea as to the best value for slice_size
given your computing platform.
Performance
So, how do these perform? I'll take your example dataset and inflate it to 100 million records.
nbr_records = 100_000_000
nbr_groups = nbr_records // 5
df = (
pl.DataFrame({'group': pl.arange(0, nbr_groups, eager=True)})
.join(
pl.DataFrame({
"b": [[1, 2, 3], [2, 3], [4], [1, 2, 3], []],
"c": [[1, 1, 1], [2, 2], [4], [1, 5, 6], []],
}),
how="cross"
)
)
df
shape: (100000000, 3)
┌──────────┬───────────┬───────────┐
│ group ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ i64 ┆ list[i64] ┆ list[i64] │
╞══════════╪═══════════╪═══════════╡
│ 0 ┆ [1, 2, 3] ┆ [1, 1, 1] │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ [2, 3] ┆ [2, 2] │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ [4] ┆ [4] │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ [1, 2, 3] ┆ [1, 5, 6] │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 19999999 ┆ [2, 3] ┆ [2, 2] │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 19999999 ┆ [4] ┆ [4] │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 19999999 ┆ [1, 2, 3] ┆ [1, 5, 6] │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 19999999 ┆ [] ┆ [] │
└──────────┴───────────┴───────────┘
I timed each of these algorithms using various slice sizes in the table below.
For reference, my computing platform is a 32-core Threadripper Pro with 512 GB of RAM.
Before constructing the 100 million row DataFrame, my RAM usage is roughly 4.7 GB. When the 100 million row DataFrame is loaded, my RAM usage rises to about 9.7 GB - before the algorithms are run.
For each run, I recorded the maximum observed RAM, to give an idea as to how RAM-hungry the algorithms are with various slice sizes.
Algorithm |
Slice size |
wall-clock time |
maximum observed RAM |
#1 |
1,000,000 |
95 sec |
11.0 GB |
#1 |
10,000,000 |
91 sec |
15.3 GB |
#1 |
100,000,000 |
80 sec |
54.4 GB |
#2 |
1,000,000 |
85 sec |
10.5 GB |
#2 |
10,000,000 |
85 sec |
11.6 GB |
#2 |
100,000,000 |
84 sec |
21.3 GB |
One notable result: the explode
solution does not suffer significantly when the algorithm is applied in slices.
Obviously, how each of these performs on your actual data and your computing platform will differ from the numbers above. And depending on the size of your lists, you may see one perform significantly better than the other.
Compared to using apply
How much faster are the above two solutions than simply using apply
and the numpy dot
function on our 100 million records?
(
df
.with_column(
pl.struct(['b', 'c'])
.apply(lambda cols: np.dot(cols['b'], cols['c']))
.alias('result')
)
)
Time: 390 seconds