I want to map a key in one Polars DataFrame to another Polars DF base on the relationships between columns. This is just a sample, the full DF1 and DF2 is much larger (2.5 million and 1.5 million rows respectively.
DF1 = = pl.DataFrame({
'chr' : ["GL000008.2", "GL000008.2", "GL000008.2", "GL000008.2","GL000008.2", "GL000008.2"],
'start': [14516,17380,17381,20177,22254,24357],
'end': [14534,17399,17399,20195,22274,24377]
})
DF2 = = pl.DataFrame({
'key' : [1,2,3,4,5,6],
'chrom' : ["GL000008.2", "GL000008.2", "GL000008.2", "GL000008.2","GL000008.2", "GL000008.2"],
'start': [14516,15377,17376,20177,22254, 24357],
'end': [14534,15403,17399,20195,22274,24377]})
What I want is:
DF1 = = pl.DataFrame({
'chr' : ["GL000008.2", "GL000008.2", "GL000008.2", "GL000008.2","GL000008.2", "GL000008.2"],
'start': [14516,17380,17381,20177,22254,24357],
'end': [14534,17399,17399,20195,22274,24377],
'key': [1,3,3,4,5,6]
})
I'd like to assign the key
from DF2 to DF1 when chrom
matches chr
and the start
and end
in DF1 are contained within the begin
and end
in DF2.
I first attempted to iterate through the rows of DF1, looking up the matching entry in DF2:
sz = len(DF1[:,0])
for i in range(sz):
DF1[i,"key"] = DF2.filter(
(pl.col("chrom") == DF1[i,"chr"])\
& (pl.col("begin") <= DF1[i,"start"])\
& (pl.col("end") >= DF1[i,"end"])
).select('key')[0,0]
Row iteration through a DF is incredibly slow. This takes about 10 hours.
I also tried using a np.array instead of directly into the df. thats a little faster, but still very slow.
I'm looking for a way to accomplish this using the native Polar's data structure. I don't have key to join on so the join
and join_asof
strategies don't seem to work.