polars: multi-threaded computing of common elements between two columns

Question

I have a huge dataset (~100M rows). As Pandas does not support multi-threading, I am trying to use polars library for analysis. The minimal problem I am trying to solve is shown below:

>>> import polars as pl
>>> df = pl.DataFrame({"col1" : ["abc", "def", "ghi"], "col2": [[1, 2, 3, 4], [1, 2], [9, 10]], "col3": [[1,4], [2,2], [3,4,5]]}) ; df
shape: (3, 3)
┌──────┬───────────────┬───────────┐
│ col1 ┆ col2          ┆ col3      │
│ ---  ┆ ---           ┆ ---       │
│ str  ┆ list[i64]     ┆ list[i64] │
╞══════╪═══════════════╪═══════════╡
│ abc  ┆ [1, 2, ... 4] ┆ [1, 4]    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ def  ┆ [1, 2]        ┆ [2, 2]    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ghi  ┆ [9, 10]       ┆ [3, 4, 5] │
└──────┴───────────────┴───────────┘

Required output is below:

┌──────┬───────────┬───────────┬───────────┐
│ col1 ┆ col2      ┆ col3      ┆ common    │
│ ---  ┆ ---       ┆ ---       ┆ ---       │
│ str  ┆ list[i64] ┆ list[i64] ┆ list[i64] │
╞══════╪═══════════╪═══════════╪═══════════╡
│ abc  ┆ [1, 2, 4] ┆ [1, 4]    ┆ 1|4       │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ def  ┆ [1, 2]    ┆ [2, 2]    ┆ 2         │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ghi  ┆ [9, 10]   ┆ [3, 4, 5] ┆           │
└──────┴───────────┴───────────┴───────────┘

The actual values in col2 and col3 are strings that I mapped onto integers hoping to be able to use ufunc froms numpy that will enable multi-thread processing unlike the slow apply method.

So far, I could solve it only by using the slow apply method.

>>> def comm(lst1, lst2):
...     s = set(lst1).intersection(set(lst2))
...     s = [str(t) for t in s]
...     return '|'.join(s)
>>> res = df.apply(lambda t: (t[0], t[1], t[2], comm(t[1], t[2])) )
>>> res.columns = ['col1', 'col2', 'col3', 'commom']
>>> res
shape: (3, 4)
┌──────┬───────────────┬───────────┬────────┐
│ col1 ┆ col2          ┆ col3      ┆ commom │
│ ---  ┆ ---           ┆ ---       ┆ ---    │
│ str  ┆ list[i64]     ┆ list[i64] ┆ str    │
╞══════╪═══════════════╪═══════════╪════════╡
│ abc  ┆ [1, 2, ... 4] ┆ [1, 4]    ┆ 1|4    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ def  ┆ [1, 2]        ┆ [2, 2]    ┆ 2      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ ghi  ┆ [9, 10]       ┆ [3, 4, 5] ┆        │
└──────┴───────────────┴───────────┴────────┘

Any help in parallelizing the code would be appreciated. TIA

100M is not huge dataset. I have a stream of 1M dictionaries (~60 k:v pairs) per second. This is huge. — Dariusz Krynicki, Mar 08 '23 at 23:17

score 1 · Accepted Answer · answered Jul 16 '22 at 14:48

We'll adapt the algorithm from this Stack Overflow question. I say "adapt" because from the minimal problem you provided, it appears that the lists in col2 and col3 are not sets, in that some values are duplicated within a list. As such, we'll need to first remove the duplicates so that we are working with sets (as opposed to lists).

The algorithm does not require that the values be numbers. To demonstrate, let's convert col2 and col3 to strings.

import polars as pl
df = pl.DataFrame(
    {
        "col1": ["abc", "def", "ghi"],
        "col2": [["cat", "dog", "mouse", "bird"], ["cat", "dog"], ["snail", "worm"]],
        "col3": [["cat", "bird"], ["dog", "dog"], ["mouse", "bird", "starfish"]],
    }
)
df

shape: (3, 3)
┌──────┬────────────────────────────┬───────────────────────────────┐
│ col1 ┆ col2                       ┆ col3                          │
│ ---  ┆ ---                        ┆ ---                           │
│ str  ┆ list[str]                  ┆ list[str]                     │
╞══════╪════════════════════════════╪═══════════════════════════════╡
│ abc  ┆ ["cat", "dog", ... "bird"] ┆ ["cat", "bird"]               │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ def  ┆ ["cat", "dog"]             ┆ ["dog", "dog"]                │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ghi  ┆ ["snail", "worm"]          ┆ ["mouse", "bird", "starfish"] │
└──────┴────────────────────────────┴───────────────────────────────┘

Adapting from the algorithm in the Stack Overflow question mentioned above, we can calculate the intersection for your minimal example as follows:

df.with_column(
    pl.col("col2")
    .arr.unique()
    .arr.concat(pl.col('col3').arr.unique())
    .arr.eval(pl.element().filter(pl.element().is_duplicated()), parallel=True)
    .arr.unique()
    .alias('intersection')
)

shape: (3, 4)
┌──────┬────────────────────────────┬───────────────────────────────┬─────────────────┐
│ col1 ┆ col2                       ┆ col3                          ┆ intersection    │
│ ---  ┆ ---                        ┆ ---                           ┆ ---             │
│ str  ┆ list[str]                  ┆ list[str]                     ┆ list[str]       │
╞══════╪════════════════════════════╪═══════════════════════════════╪═════════════════╡
│ abc  ┆ ["cat", "dog", ... "bird"] ┆ ["cat", "bird"]               ┆ ["cat", "bird"] │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ def  ┆ ["cat", "dog"]             ┆ ["dog", "dog"]                ┆ ["dog"]         │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ghi  ┆ ["snail", "worm"]          ┆ ["mouse", "bird", "starfish"] ┆ []              │
└──────┴────────────────────────────┴───────────────────────────────┴─────────────────┘

For a discussion of how the algorithm works, please see the above-mentioned Stack Overflow question.

Performance

To get some idea of performance, let's take our data and replicate it to about ~100M records. We'll use a cross join to accomplish this easily.

df = pl.DataFrame(
    {
        "col1": ["abc", "def", "ghi"],
        "col2": [["cat", "dog", "mouse", "bird"], ["cat", "dog"], ["snail", "worm"]],
        "col3": [["cat", "bird"], ["dog", "dog"], ["mouse", "bird", "starfish"]],
    }
)

nbr_groups = 34_000_000
df = (
    df
    .join(
        pl.DataFrame({
            'group': pl.arange(0, nbr_groups, eager=True)
        }),
        how="cross"
    )
)
df

shape: (102000000, 4)
┌──────┬────────────────────────────┬───────────────────────────────┬──────────┐
│ col1 ┆ col2                       ┆ col3                          ┆ group    │
│ ---  ┆ ---                        ┆ ---                           ┆ ---      │
│ str  ┆ list[str]                  ┆ list[str]                     ┆ i64      │
╞══════╪════════════════════════════╪═══════════════════════════════╪══════════╡
│ abc  ┆ ["cat", "dog", ... "bird"] ┆ ["cat", "bird"]               ┆ 0        │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ abc  ┆ ["cat", "dog", ... "bird"] ┆ ["cat", "bird"]               ┆ 1        │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ abc  ┆ ["cat", "dog", ... "bird"] ┆ ["cat", "bird"]               ┆ 2        │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ abc  ┆ ["cat", "dog", ... "bird"] ┆ ["cat", "bird"]               ┆ 3        │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ ...  ┆ ...                        ┆ ...                           ┆ ...      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ ghi  ┆ ["snail", "worm"]          ┆ ["mouse", "bird", "starfish"] ┆ 33999996 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ ghi  ┆ ["snail", "worm"]          ┆ ["mouse", "bird", "starfish"] ┆ 33999997 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ ghi  ┆ ["snail", "worm"]          ┆ ["mouse", "bird", "starfish"] ┆ 33999998 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ ghi  ┆ ["snail", "worm"]          ┆ ["mouse", "bird", "starfish"] ┆ 33999999 │
└──────┴────────────────────────────┴───────────────────────────────┴──────────┘

On my platform (a 32-core Threadripper Pro), I timed the algorithm using Python's time.perf_counter. The algorithm took 163 seconds to complete for the 102M records.

import time

start = time.perf_counter()
df.with_column(
    pl.col("col2")
    .arr.unique()
    .arr.concat(pl.col('col3').arr.unique())
    .arr.eval(pl.element().filter(pl.element().is_duplicated()), parallel=True)
    .arr.unique()
    .alias('intersection')
)
print(time.perf_counter() - start)

shape: (102000000, 5)
┌──────┬────────────────────────────┬───────────────────────────────┬──────────┬─────────────────┐
│ col1 ┆ col2                       ┆ col3                          ┆ group    ┆ intersection    │
│ ---  ┆ ---                        ┆ ---                           ┆ ---      ┆ ---             │
│ str  ┆ list[str]                  ┆ list[str]                     ┆ i64      ┆ list[str]       │
╞══════╪════════════════════════════╪═══════════════════════════════╪══════════╪═════════════════╡
│ abc  ┆ ["cat", "dog", ... "bird"] ┆ ["cat", "bird"]               ┆ 0        ┆ ["bird", "cat"] │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ abc  ┆ ["cat", "dog", ... "bird"] ┆ ["cat", "bird"]               ┆ 1        ┆ ["bird", "cat"] │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ abc  ┆ ["cat", "dog", ... "bird"] ┆ ["cat", "bird"]               ┆ 2        ┆ ["bird", "cat"] │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ abc  ┆ ["cat", "dog", ... "bird"] ┆ ["cat", "bird"]               ┆ 3        ┆ ["cat", "bird"] │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...  ┆ ...                        ┆ ...                           ┆ ...      ┆ ...             │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ghi  ┆ ["snail", "worm"]          ┆ ["mouse", "bird", "starfish"] ┆ 33999996 ┆ []              │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ghi  ┆ ["snail", "worm"]          ┆ ["mouse", "bird", "starfish"] ┆ 33999997 ┆ []              │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ghi  ┆ ["snail", "worm"]          ┆ ["mouse", "bird", "starfish"] ┆ 33999998 ┆ []              │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ghi  ┆ ["snail", "worm"]          ┆ ["mouse", "bird", "starfish"] ┆ 33999999 ┆ []              │
└──────┴────────────────────────────┴───────────────────────────────┴──────────┴─────────────────┘
>>> print(time.perf_counter() - start)
162.8689589149999

As you watch CPU usage, you'll note that the algorithm initially spawns only two threads. These two threads are presumably to de-dup col2 and col3 (the arr.unique calls).

After both threads complete, you'll see the algorithm spread across all CPU cores, as it calculates the intersection in the arr.eval step. (Note: the parallel=True keyword is important.)

The algorithm then returns to a single thread to de-dup the intersection column (a necessary step).

If your actual data does not contain duplicate values in the lists (in the equivalent of col2 and col3), then you can save the initial de-dup process.

One further note: you can try to use the algorithm on the original strings versus the integer values that you created. You'll have to judge whether for your actual data and your computing platform whether any speed-up running the algorithm on numbers is worth the cost of converting the strings to numbers and then converting them back to strings after the algorithm.

Thank you very much. I do not know rust, and I could not find a way to use expressions to implement this. I can remove the duplicates in the list or even make columns of type `set` if I can exploit the data structure. I tried `intersect1d` from `numpy` but performance was terrible. How did concatenate solve the problem? It appears that first unique elements from `col2` and `col3` are added and only the duplicated elements are selected. Am I correct? With 256 cores, my wall time was 229 seconds! — xinit, Jul 16 '22 at 15:52
Yep, that’s exactly how the algorithm works. And if you can remove the duplicates upstream, you can eliminate the first two `arr.unique` calls - though the third one is required (to remove duplicates in the results). — , Jul 16 '22 at 16:29

polars: multi-threaded computing of common elements between two columns

1 Answers1

Performance