2

I would like to compare two excel-tables and add data from one to the other.

Specifically, I have a table for "sorting information", which is structured as follows:

street Hnr_from Hnr_to hnr_cond Postcode City SortInfo
Hasel 6 8 0 49082 Osnb OS1-01
Hasel 1 5 1 49083 Osnb OS1-02
Mosel 4 6 2 49084 Munst MU1-01

The hnr_cond is there to know if only the odd (1), only the even (2), or all the house numbers (0) belong in the range "from", "to" in this district.

My Data-Frame i want to Compare this to looks like the following:

street Hnr Postcode City
Hasel 7 49082 Osnb
Mosel 4 49084 Munst

My first attempt to find out how to work with Polars in general looked like this:

import polars as pl

xlsx_file = "Path/to/Sortinfo-file.xlsx"
df = pl.read_excel(xlsx_file)

searched_nr = 3
searched_str = "Mosel"
searched_postc = 49088
searched_city = "Osnbr"

filtered_df = df.filter(
    (searched_nr >= df['Hnr_from']) &
    (searched_nr <= df['Hnr_to']) &
    (((searched_nr % 2 == 0) & (df['hnr_cond'] == '2')) |
     ((searched_nr % 2 == 1) & (df['hnr_cond'] == '1')) |
     (df['hnr_cond'] == '0')) &
    (searched_str == df['street']) &
    (searched_postc == df['Postcode']) &
    (searched_city == df['City'])
)

SortInfo = None
if not filtered_df.is_empty():
    bezeichnung = filtered_df.to_pandas()['bezirk'].iloc[0]

print(SortInfo)

I also tried to load the other Dataframe as a file but i only got that far that it read the first line and compared it to the SortInfo_df but it dismissed everything after the first line.

My End Result should be like the Following:

A Dataframe with:

street Hnr Postcode City SortInfo
Hasel 7 49082 Osnb OS1-01
Mosel 4 49084 Munst MU1-01

I tried to load the other file like this:

xlsx_file = "path/to/other-file.xlsx"
df_searched = pl.read_excel(xlsx_file)

searched_number = df_seached['Hnr']
searched_street = df_searched['street']
searched_postcode = df_searched['Postcode']
searched_City = df_searched['City']

Note i quckliy translated everything that was not in english. If i missed something or did something wrong please inform me.

1 Answers1

0

It looks like .join_asof() problem.

https://pola-rs.github.io/polars-book/user-guide/transformations/joins/#asof-join

You can use the by argument to specify columns that should be matched for equality.

table = pl.DataFrame(
    {
        "street": ["Hasel", "Hasel", "Mosel"],
        "Hnr_from": [6, 1, 4],
        "Hnr_to": [8, 5, 6],
        "hnr_cond": [0, 1, 2],
        "Postcode": [49082, 49083, 49084],
        "City": ["Osnb", "Osnb", "Munst"],
        "SortInfo": ["OS1-01", "OS1-02", "MU1-01"],
    }
)

df = pl.DataFrame(
    {
        "street": ["Hasel", "Mosel"],
        "Hnr": [7, 4],
        "Postcode": [49082, 49084],
        "City": ["Osnb", "Munst"],
    }
)

df.join_asof(
    table,
    by=["street", "Postcode", "City"],
    left_on="Hnr",
    right_on="Hnr_to",
    strategy="forward",
)
shape: (2, 8)
┌────────┬─────┬──────────┬───────┬──────────┬────────┬──────────┬──────────┐
│ street ┆ Hnr ┆ Postcode ┆ City  ┆ Hnr_from ┆ Hnr_to ┆ hnr_cond ┆ SortInfo │
│ ---    ┆ --- ┆ ---      ┆ ---   ┆ ---      ┆ ---    ┆ ---      ┆ ---      │
│ str    ┆ i64 ┆ i64      ┆ str   ┆ i64      ┆ i64    ┆ i64      ┆ str      │
╞════════╪═════╪══════════╪═══════╪══════════╪════════╪══════════╪══════════╡
│ Hasel  ┆ 7   ┆ 49082    ┆ Osnb  ┆ 6        ┆ 8      ┆ 0        ┆ OS1-01   │
│ Mosel  ┆ 4   ┆ 49084    ┆ Munst ┆ 4        ┆ 6      ┆ 2        ┆ MU1-01   │
└────────┴─────┴──────────┴───────┴──────────┴────────┴──────────┴──────────┘

You can then .filter() the result with the Hnr_* conditions you have specified.

jqurious
  • 9,953
  • 1
  • 4
  • 14
  • Hello, thanks for the quick response! This seems correct. When I first tried it with my filter options, I mistyped my searches and was wondering if there is a problem if I skip a row, but this works. My other concern is now that I want to keep the "wrong" addresses that aren't listed in my SortInfo. So if I have an address like: Wesel | 10 | 49090 | Munst, it should become a "NaN" in the new list. I also finally managed to load the dataframes as xlsx files correctly, so this problem can be checked as well. – Yannik Brüggemann Aug 19 '23 at 23:21