Polars column from conditioned look up of dictionary values

Question

I want to map a key in one Polars DataFrame to another Polars DF base on the relationships between columns. This is just a sample, the full DF1 and DF2 is much larger (2.5 million and 1.5 million rows respectively.

DF1 = = pl.DataFrame({
'chr' : ["GL000008.2", "GL000008.2", "GL000008.2", "GL000008.2","GL000008.2", "GL000008.2"], 
'start': [14516,17380,17381,20177,22254,24357], 
'end': [14534,17399,17399,20195,22274,24377]
})

DF2 = = pl.DataFrame({ 
'key' : [1,2,3,4,5,6],
'chrom' : ["GL000008.2", "GL000008.2", "GL000008.2", "GL000008.2","GL000008.2", "GL000008.2"], 
'start': [14516,15377,17376,20177,22254, 24357], 
'end': [14534,15403,17399,20195,22274,24377]})

What I want is:

DF1 = = pl.DataFrame({
'chr' : ["GL000008.2", "GL000008.2", "GL000008.2", "GL000008.2","GL000008.2", "GL000008.2"], 
'start': [14516,17380,17381,20177,22254,24357], 
'end': [14534,17399,17399,20195,22274,24377],
'key': [1,3,3,4,5,6]
})

I'd like to assign the key from DF2 to DF1 when chrom matches chr and the start and end in DF1 are contained within the begin and end in DF2.

I first attempted to iterate through the rows of DF1, looking up the matching entry in DF2:

sz = len(DF1[:,0])

for i in range(sz):
    DF1[i,"key"] = DF2.filter(
        (pl.col("chrom") == DF1[i,"chr"])\
        & (pl.col("begin") <= DF1[i,"start"])\
        & (pl.col("end") >= DF1[i,"end"])
        ).select('key')[0,0]

Row iteration through a DF is incredibly slow. This takes about 10 hours.

I also tried using a np.array instead of directly into the df. thats a little faster, but still very slow.

I'm looking for a way to accomplish this using the native Polar's data structure. I don't have key to join on so the join and join_asof strategies don't seem to work.

can you make your example reproducible and show expected output please? — ignoring_gravity, Feb 15 '23 at 21:38
Would this help? https://stackoverflow.com/a/73177515/20557510 — ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ, Feb 15 '23 at 21:47
Another example would be: https://github.com/pola-rs/polars/issues/5004#issuecomment-1259784083 — ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ, Feb 15 '23 at 21:51
@ignoring_gravity I tried to add a better explanation, sorry. — Jim Beck, Feb 16 '23 at 18:59
@ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ, Thanks for your suggestions - but I don't have a key to join on - its really just three conditions. In fact, the key is what I'm trying to establish in DF1 so I can do joins. — Jim Beck, Feb 16 '23 at 18:59
thanks @JimBeck - it's still not reproducible though. can you post something we can copy and paste? — ignoring_gravity, Feb 16 '23 at 19:03
@ignoring_gravity I added the actual full DF1 and DF2 via a link to a repo - so they can be downloaded. The code above can be run on those files. I'm not sure what else I can provide to make it copy and paste. — Jim Beck, Feb 16 '23 at 19:28
something like `df1 = pl.DataFrame({'start': [14516, 14517, ...], 'end': [14712, ...]})`, `df2 = ...` . We should be able to run your code without downloading data — ignoring_gravity, Feb 16 '23 at 19:37
@ignoring_gravity Thanks for the guidance. And just to be clear - the code I wrote works - its just so bloody slow. — Jim Beck, Feb 16 '23 at 20:02

score 0 · Accepted Answer · answered Feb 16 '23 at 21:30

join and filter should give you what you need:

(
    df1.join(df2, left_on="chr", right_on="chrom")
    .filter(
        (pl.col("start") >= pl.col("start_right"))
        & (pl.col("end") <= pl.col("end_right"))
    )
    .drop(["start_right", "end_right"])
)

shape: (6, 4)
┌────────────┬───────┬───────┬─────┐
│ chr        ┆ start ┆ end   ┆ key │
│ ---        ┆ ---   ┆ ---   ┆ --- │
│ str        ┆ i64   ┆ i64   ┆ i64 │
╞════════════╪═══════╪═══════╪═════╡
│ GL000008.2 ┆ 14516 ┆ 14534 ┆ 1   │
│ GL000008.2 ┆ 17380 ┆ 17399 ┆ 3   │
│ GL000008.2 ┆ 17381 ┆ 17399 ┆ 3   │
│ GL000008.2 ┆ 20177 ┆ 20195 ┆ 4   │
│ GL000008.2 ┆ 22254 ┆ 22274 ┆ 5   │
│ GL000008.2 ┆ 24357 ┆ 24377 ┆ 6   │
└────────────┴───────┴───────┴─────┘

Thanks for the solution. I was thinking I couldn't use a join because the `chr` field is a many to many relationship between both DF1 and DF2. — Jim Beck, Feb 18 '23 at 03:57

score 0 · Answer 2 · answered Feb 17 '23 at 00:23

Using a join_asof might provide a performant solution:

(
    DF1
    .sort('start')
    .join_asof(
        DF2.sort('start'),
        by_left="chr",
        by_right="chrom",
        on="start",
        strategy="backward")
    .filter(
        pl.col('end') <= pl.col('end_right')
    )
)

shape: (6, 5)
┌────────────┬───────┬───────┬─────┬───────────┐
│ chr        ┆ start ┆ end   ┆ key ┆ end_right │
│ ---        ┆ ---   ┆ ---   ┆ --- ┆ ---       │
│ str        ┆ i64   ┆ i64   ┆ i64 ┆ i64       │
╞════════════╪═══════╪═══════╪═════╪═══════════╡
│ GL000008.2 ┆ 14516 ┆ 14534 ┆ 1   ┆ 14534     │
│ GL000008.2 ┆ 17380 ┆ 17399 ┆ 3   ┆ 17399     │
│ GL000008.2 ┆ 17381 ┆ 17399 ┆ 3   ┆ 17399     │
│ GL000008.2 ┆ 20177 ┆ 20195 ┆ 4   ┆ 20195     │
│ GL000008.2 ┆ 22254 ┆ 22274 ┆ 5   ┆ 22274     │
│ GL000008.2 ┆ 24357 ┆ 24377 ┆ 6   ┆ 24377     │
└────────────┴───────┴───────┴─────┴───────────┘

Note: this assumes that your start-end intervals in DF2 do not overlap.

Polars column from conditioned look up of dictionary values

2 Answers2