One way in Pandas to map a Series from one value to another (or to do a 'lookup value') is with the map function.
So If I have two DataFrames (say one dimension and one fact table):
# imports
import numpy as np
import pandas as pd
# Dimension Table - DataFrame to hold the letter and its ID.
letters_ids = pd.DataFrame({
'Letters': ['A', 'B', 'C'],
'Letters_id': [1, 2, 3]
}).set_index('Letters')
# Fact Table - DataFrame with millions of letters.
many_letters = pd.DataFrame({
'Letters': np.random.choice(['A', 'B', 'C'], 10_000_000)
})
And considering that the business model require that we add to the fact table the 'Letters_id' we could do:
many_letters['letters_mapped'] = many_letters.Letters.map(letters_ids.squeeze())
That's clean and straightforward.
Now what about Polars? I mean Polars have a map function but doesn't seems to work like Pandas.
I found two ways with Polars to map a Series from one value to another (or to do a 'lookup value') but I'm feeling I'm missing a proper way of doing it.
So say we have the same dataset with Polars:
# imports
import numpy as np
import polars as pl
# Dimension Table - DataFrame to hold the letter and its ID.
letters_ids = pl.DataFrame({
'Letters': ['A', 'B', 'C'],
'Letters_id': [1, 2, 3]
})
# Fact Table - DataFrame with millions of letters.
many_letters = pl.DataFrame({
'Letters': np.random.choice(['A', 'B', 'C'], 10_000_000)
})
Now in order to add to the fact table the 'Letters_id':
Approach # 1 - We can use the join method:
many_letters = many_letters.join(
other=letters_ids,
on='Letters',
how='left'
).rename({'Letters_id': 'letters_mapped'})
Approach # 2 - Use the map_dict (found the idea at this SO question)
# Convert the two columns DataFrame into a Python's dictionary.
letters_dict = dict(letters_ids.iter_rows())
# Maps the dictionary
many_letters = many_letters.with_columns(
pl.col('Letters').map_dict(letters_dict).alias('letters_mapped')
)
Just as an information when considering performance, both Polars approaches above are faster to do the job then Pandas:
- Pandas mapping execution time (for 10 million rows): average 0.35 seconds
- Polars mapping execution time (for 10 million rows): average 0.21 seconds (Approach # 1)
- Polars mapping execution time (for 10 million rows): average 0.20 seconds (Approach # 2)
Which of these is preferred or is there a 3rd approach which may be better still?