"Vlookup" on Dataframes with multiple conditions in Python (Polars)

Question

I would like to compare two excel-tables and add data from one to the other.

Specifically, I have a table for "sorting information", which is structured as follows:

street	Hnr_from	Hnr_to	hnr_cond	Postcode	City	SortInfo
Hasel	6	8	0	49082	Osnb	OS1-01
Hasel	1	5	1	49083	Osnb	OS1-02
Mosel	4	6	2	49084	Munst	MU1-01

The hnr_cond is there to know if only the odd (1), only the even (2), or all the house numbers (0) belong in the range "from", "to" in this district.

My Data-Frame i want to Compare this to looks like the following:

street	Hnr	Postcode	City
Hasel	7	49082	Osnb
Mosel	4	49084	Munst

My first attempt to find out how to work with Polars in general looked like this:

import polars as pl

xlsx_file = "Path/to/Sortinfo-file.xlsx"
df = pl.read_excel(xlsx_file)

searched_nr = 3
searched_str = "Mosel"
searched_postc = 49088
searched_city = "Osnbr"

filtered_df = df.filter(
    (searched_nr >= df['Hnr_from']) &
    (searched_nr <= df['Hnr_to']) &
    (((searched_nr % 2 == 0) & (df['hnr_cond'] == '2')) |
     ((searched_nr % 2 == 1) & (df['hnr_cond'] == '1')) |
     (df['hnr_cond'] == '0')) &
    (searched_str == df['street']) &
    (searched_postc == df['Postcode']) &
    (searched_city == df['City'])
)

SortInfo = None
if not filtered_df.is_empty():
    bezeichnung = filtered_df.to_pandas()['bezirk'].iloc[0]

print(SortInfo)

I also tried to load the other Dataframe as a file but i only got that far that it read the first line and compared it to the SortInfo_df but it dismissed everything after the first line.

My End Result should be like the Following:

A Dataframe with:

street	Hnr	Postcode	City	SortInfo
Hasel	7	49082	Osnb	OS1-01
Mosel	4	49084	Munst	MU1-01

I tried to load the other file like this:

xlsx_file = "path/to/other-file.xlsx"
df_searched = pl.read_excel(xlsx_file)

searched_number = df_seached['Hnr']
searched_street = df_searched['street']
searched_postcode = df_searched['Postcode']
searched_City = df_searched['City']

Note i quckliy translated everything that was not in english. If i missed something or did something wrong please inform me.

score 0 · Answer 1 · answered Aug 19 '23 at 22:30

It looks like .join_asof() problem.

https://pola-rs.github.io/polars-book/user-guide/transformations/joins/#asof-join

You can use the by argument to specify columns that should be matched for equality.

table = pl.DataFrame(
    {
        "street": ["Hasel", "Hasel", "Mosel"],
        "Hnr_from": [6, 1, 4],
        "Hnr_to": [8, 5, 6],
        "hnr_cond": [0, 1, 2],
        "Postcode": [49082, 49083, 49084],
        "City": ["Osnb", "Osnb", "Munst"],
        "SortInfo": ["OS1-01", "OS1-02", "MU1-01"],
    }
)

df = pl.DataFrame(
    {
        "street": ["Hasel", "Mosel"],
        "Hnr": [7, 4],
        "Postcode": [49082, 49084],
        "City": ["Osnb", "Munst"],
    }
)

df.join_asof(
    table,
    by=["street", "Postcode", "City"],
    left_on="Hnr",
    right_on="Hnr_to",
    strategy="forward",
)

shape: (2, 8)
┌────────┬─────┬──────────┬───────┬──────────┬────────┬──────────┬──────────┐
│ street ┆ Hnr ┆ Postcode ┆ City  ┆ Hnr_from ┆ Hnr_to ┆ hnr_cond ┆ SortInfo │
│ ---    ┆ --- ┆ ---      ┆ ---   ┆ ---      ┆ ---    ┆ ---      ┆ ---      │
│ str    ┆ i64 ┆ i64      ┆ str   ┆ i64      ┆ i64    ┆ i64      ┆ str      │
╞════════╪═════╪══════════╪═══════╪══════════╪════════╪══════════╪══════════╡
│ Hasel  ┆ 7   ┆ 49082    ┆ Osnb  ┆ 6        ┆ 8      ┆ 0        ┆ OS1-01   │
│ Mosel  ┆ 4   ┆ 49084    ┆ Munst ┆ 4        ┆ 6      ┆ 2        ┆ MU1-01   │
└────────┴─────┴──────────┴───────┴──────────┴────────┴──────────┴──────────┘

You can then .filter() the result with the Hnr_* conditions you have specified.

Hello, thanks for the quick response! This seems correct. When I first tried it with my filter options, I mistyped my searches and was wondering if there is a problem if I skip a row, but this works. My other concern is now that I want to keep the "wrong" addresses that aren't listed in my SortInfo. So if I have an address like: Wesel | 10 | 49090 | Munst, it should become a "NaN" in the new list. I also finally managed to load the dataframes as xlsx files correctly, so this problem can be checked as well. — Yannik Brüggemann, Aug 19 '23 at 23:21

"Vlookup" on Dataframes with multiple conditions in Python (Polars)

1 Answers1