3

I have a simple dataframe as follows:

import polars as pl

df = pl.DataFrame(
    {
        "group": [1, 1, 1, 1, 2, 2, 2, 2],
        "a": [1, 2, 3, 4, 1, 2, 3, 4],
        "b": [5, 1, 7, 9, 2, 4, 9, 7],
        "c": [2, 6, 3, 9, 1, 5, 3, 6],
    }
)

I want to have a correlation 'matrix' resides in polars dataframe structured like the one below. How can I do that?

┌───────┬──────┬──────────┬──────────┬──────────┐
│ group ┆ name ┆ a        ┆ b        ┆ c        │
│ ---   ┆ ---  ┆ ---      ┆ ---      ┆ ---      │
│ i64   ┆ str  ┆ f64      ┆ f64      ┆ f64      │
╞═══════╪══════╪══════════╪══════════╪══════════╡
│ 1     ┆ a    ┆ 1.0      ┆ 0.680336 ┆ 0.734847 │
│ 1     ┆ b    ┆ 0.680336 ┆ 1.0      ┆ 0.246885 │
│ 1     ┆ c    ┆ 0.734847 ┆ 0.246885 ┆ 1.0      │
│ 2     ┆ a    ┆ 1.0      ┆ 0.830455 ┆ 0.756889 │
│ 2     ┆ b    ┆ 0.830455 ┆ 1.0      ┆ 0.410983 │
│ 2     ┆ c    ┆ 0.756889 ┆ 0.410983 ┆ 1.0      │
└───────┴──────┴──────────┴──────────┴──────────┘

Currently, this is what I tried:

df.groupby("group").agg(
    [
        pl.corr(col1, col2).alias(f"{col1}_{col2}")
        for col1 in ["a", "b", "c"]
        for col2 in ["a", "b", "c"]
    ]
)

shape: (2, 10)
┌───────┬─────┬──────────┬──────────┬─────┬──────────┬──────────┬──────────┬─────┐
│ group ┆ a_a ┆ a_b      ┆ a_c      ┆ ... ┆ b_c      ┆ c_a      ┆ c_b      ┆ c_c │
│ ---   ┆ --- ┆ ---      ┆ ---      ┆     ┆ ---      ┆ ---      ┆ ---      ┆ --- │
│ i64   ┆ f64 ┆ f64      ┆ f64      ┆     ┆ f64      ┆ f64      ┆ f64      ┆ f64 │
╞═══════╪═════╪══════════╪══════════╪═════╪══════════╪══════════╪══════════╪═════╡
│ 2     ┆ 1.0 ┆ 0.830455 ┆ 0.756889 ┆ ... ┆ 0.410983 ┆ 0.756889 ┆ 0.410983 ┆ 1.0 │
│ 1     ┆ 1.0 ┆ 0.680336 ┆ 0.734847 ┆ ... ┆ 0.246885 ┆ 0.734847 ┆ 0.246885 ┆ 1.0 │
└───────┴─────┴──────────┴──────────┴─────┴──────────┴──────────┴──────────┴─────┘

So, not sure about how can I transform it to the shape/structure I want? Or, are there some other (potentially better) ways to generate the results I want directly?

lebesgue
  • 837
  • 4
  • 13

2 Answers2

4

@jqurious, here's an approach that uses the corr function on the DataFrame itself.

(
    pl.concat(
        next_df
        .select(pl.exclude('group'))
        .corr()
        .select([
            pl.lit(next_group).alias('group'),
            pl.Series(next_df.columns[1:]).alias('name'),
            pl.all()
        ])
        for next_group, next_df
        in df.partition_by('group', as_dict=True).items()
    )
)
shape: (6, 5)
┌───────┬──────┬──────────┬──────────┬──────────┐
│ group ┆ name ┆ a        ┆ b        ┆ c        │
│ ---   ┆ ---  ┆ ---      ┆ ---      ┆ ---      │
│ i32   ┆ str  ┆ f64      ┆ f64      ┆ f64      │
╞═══════╪══════╪══════════╪══════════╪══════════╡
│ 1     ┆ a    ┆ 1.0      ┆ 0.680336 ┆ 0.734847 │
│ 1     ┆ b    ┆ 0.680336 ┆ 1.0      ┆ 0.246885 │
│ 1     ┆ c    ┆ 0.734847 ┆ 0.246885 ┆ 1.0      │
│ 2     ┆ a    ┆ 1.0      ┆ 0.830455 ┆ 0.756889 │
│ 2     ┆ b    ┆ 0.830455 ┆ 1.0      ┆ 0.410983 │
│ 2     ┆ c    ┆ 0.756889 ┆ 0.410983 ┆ 1.0      │
└───────┴──────┴──────────┴──────────┴──────────┘

Performance

How does it perform? Let's take 1,000 groups and 100,000 observations per group:

import numpy as np
import time

nbr_groups = 1_000
nbr_obs_per_group = 100_000

rng = np.random.default_rng(1)
df = pl.DataFrame({
    'group': list(range(0, nbr_groups)) * nbr_obs_per_group,
    **{col_nm: rng.normal(0, 1, nbr_obs_per_group * nbr_groups)
       for col_nm in ['a', 'b', 'c']
       }
})
df
shape: (100000000, 4)
┌───────┬───────────┬───────────┬───────────┐
│ group ┆ a         ┆ b         ┆ c         │
│ ---   ┆ ---       ┆ ---       ┆ ---       │
│ i64   ┆ f64       ┆ f64       ┆ f64       │
╞═══════╪═══════════╪═══════════╪═══════════╡
│ 0     ┆ 0.345584  ┆ -0.858613 ┆ 1.382227  │
│ 1     ┆ 0.821618  ┆ 0.965737  ┆ 1.086405  │
│ 2     ┆ 0.330437  ┆ -0.567488 ┆ 0.57299   │
│ 3     ┆ -1.303157 ┆ 1.070117  ┆ 0.147326  │
│ …     ┆ …         ┆ …         ┆ …         │
│ 996   ┆ 0.842205  ┆ 0.515653  ┆ 0.88825   │
│ 997   ┆ -0.133607 ┆ -1.1532   ┆ -1.041619 │
│ 998   ┆ -0.256237 ┆ 0.654807  ┆ -0.852552 │
│ 999   ┆ -0.627053 ┆ -0.133583 ┆ 0.531616  │
└───────┴───────────┴───────────┴───────────┘
start = time.perf_counter()
(
    pl.concat(
        next_df
        .select(pl.exclude('group'))
        .corr()
        .select([
            pl.lit(next_group).alias('group'),
            pl.Series(next_df.columns[1:]).alias('name'),
            pl.all()
        ])
        for next_group, next_df
        in df.partition_by('group', as_dict=True).items()
    )
)
print(time.perf_counter() - start)
shape: (3000, 5)
┌───────┬──────┬───────────┬───────────┬───────────┐
│ group ┆ name ┆ a         ┆ b         ┆ c         │
│ ---   ┆ ---  ┆ ---       ┆ ---       ┆ ---       │
│ i32   ┆ str  ┆ f64       ┆ f64       ┆ f64       │
╞═══════╪══════╪═══════════╪═══════════╪═══════════╡
│ 0     ┆ a    ┆ 1.0       ┆ 0.002105  ┆ 0.006153  │
│ 0     ┆ b    ┆ 0.002105  ┆ 1.0       ┆ -0.001446 │
│ 0     ┆ c    ┆ 0.006153  ┆ -0.001446 ┆ 1.0       │
│ 1     ┆ a    ┆ 1.0       ┆ -0.00137  ┆ -0.003253 │
│ …     ┆ …    ┆ …         ┆ …         ┆ …         │
│ 998   ┆ c    ┆ 0.002755  ┆ 0.001199  ┆ 1.0       │
│ 999   ┆ a    ┆ 1.0       ┆ -0.001362 ┆ -0.000156 │
│ 999   ┆ b    ┆ -0.001362 ┆ 1.0       ┆ -0.00049  │
│ 999   ┆ c    ┆ -0.000156 ┆ -0.00049  ┆ 1.0       │
└───────┴──────┴───────────┴───────────┴───────────┘
>>> print(time.perf_counter() - start)
6.240678512999693

Roughly 6 seconds on my 32-core system for a DataFrame of 100 million records. There may be faster algorithms, but the code is relatively straightforward.

  • Ah, there's also a `DataFrame.corr` - thanks for the pointer. Yes I agree, it's a very readable approach. +1 – jqurious Mar 22 '23 at 19:29
  • 1
    Does select respect the order we specified (will the column order always be - group, name, a, b, c)? Did not pay atntion to this before because did not care much about column order. But in this case, i do care, otherwise the corr matrix will not a symmetric matrix. – lebesgue Mar 22 '23 at 19:35
  • 1
    Another question, does this only work for pl.DataFrame as I cannot see corr method on Lazy DataFrame in the doc. – lebesgue Mar 22 '23 at 19:38
  • Select always respects the order of the arguments. Also, the corr() function is not available with LazyFrames. – ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ Mar 22 '23 at 20:06
  • from my testing, the solution you outlined is significant slower than the one from @jqurious when nbr_obs_per_group = 10_000. – lebesgue Mar 23 '23 at 00:42
  • @lebesgue Try increase `nbr_obs_per_group` - this approach becomes the faster. It's also more "natural" to read. Changing `.select(pl.exclude('group'))` to `.drop('group')` also seems to shave off a small % (maybe it's just "noise") – jqurious Mar 23 '23 at 07:16
2

here is one way you can do it:

(
    df.groupby("group")
    .agg([
            pl.corr(col1, col2).alias(f"{col1}_{col2}")
            for col1 in ["a", "b", "c"]
            for col2 in ["a", "b", "c"]])
    .melt(id_vars='group')
    .with_columns(
        var1 = pl.col('variable').str.split('_').arr.get(0),
        var2 = pl.col('variable').str.split('_').arr.get(1)
        )
    .pivot(values = 'value', index=['group','var1'], columns='var2')
    .sort(by='group')
)

shape: (6, 5)
┌───────┬──────┬──────────┬──────────┬──────────┐
│ group ┆ var1 ┆ a        ┆ b        ┆ c        │
│ ---   ┆ ---  ┆ ---      ┆ ---      ┆ ---      │
│ i64   ┆ str  ┆ f64      ┆ f64      ┆ f64      │
╞═══════╪══════╪══════════╪══════════╪══════════╡
│ 1     ┆ a    ┆ 1.0      ┆ 0.680336 ┆ 0.734847 │
│ 1     ┆ b    ┆ 0.680336 ┆ 1.0      ┆ 0.246885 │
│ 1     ┆ c    ┆ 0.734847 ┆ 0.246885 ┆ 1.0      │
│ 2     ┆ a    ┆ 1.0      ┆ 0.830455 ┆ 0.756889 │
│ 2     ┆ b    ┆ 0.830455 ┆ 1.0      ┆ 0.410983 │
│ 2     ┆ c    ┆ 0.756889 ┆ 0.410983 ┆ 1.0      │
└───────┴──────┴──────────┴──────────┴──────────┘
Luca
  • 1,216
  • 6
  • 10
  • Do we need to reorder columns afterwards to make sure the order is correct? Or the order will always be guaranteed? – lebesgue Mar 22 '23 at 18:13
  • Hi @lebesgue Yes I had to reorder them – Luca Mar 22 '23 at 18:17
  • How can I change your code to make sure the cor matrix will be ordered correctly? – lebesgue Mar 22 '23 at 18:22
  • The sort at the end will take care of it. The number of rows after calculating the correlation should be small so the sorting at the end should have no impact on performance – Luca Mar 22 '23 at 18:29
  • Thanks! What I mean is the order of "a", "b" and "c" on both rows and columns. Will they be always the same order or different runs could result in different order. – lebesgue Mar 22 '23 at 18:32
  • @lebesgue ok now I get your question. It’s true that Polars runs calculations in parallel whenever possible so that might result in reordering depending on which calculation finishes first. But I believe you should be ok in this case – Luca Mar 22 '23 at 18:56
  • so that means in order to make sure the order is aligned correctly, we better to re order columns/rows? – lebesgue Mar 22 '23 at 19:27
  • for functions that work in parallel, the order might be changed. Some functions have the option to keep the order, like groupby. If you want to make sure of the order, you can always add a sort at the end. For a final table this small, the sort will be inexpensive – Luca Mar 22 '23 at 19:37
  • Last point- this solution will likely not be the most efficient because we are calculating each correlation twice. Unless Polars behind the scenes recognises and optimises this. Would need to test it – Luca Mar 22 '23 at 19:45