It looks as though you might be trying to update the values of DataFrame, particularly where values are missing (the "N/A" values).
In addition the advice of @jvz, I would recommend using a left join
for your purposes, rather than using a dictionary and a for loop. Using for loops is very slow, and is to be avoided. By contrast, a left join
will be very performant, and is built for exactly these types of situations.
We'll take this in steps.
First, let's first expand your example.
df = pl.DataFrame(
{"IP": ["1.1.1.1", "2.2.2.2", "3.3.3.3", "4.4.4.4"],
"ISP": ["N/A", "N/A", "PQR", "N/A"]}
)
df
shape: (4, 2)
┌─────────┬─────┐
│ IP ┆ ISP │
│ --- ┆ --- │
│ str ┆ str │
╞═════════╪═════╡
│ 1.1.1.1 ┆ N/A │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2.2.2.2 ┆ N/A │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 3.3.3.3 ┆ PQR │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 4.4.4.4 ┆ N/A │
└─────────┴─────┘
Notice that we have three rows with "N/A" values, but one row that already has a valid value, "PQR".
Next, let's convert your dictionary of updated ISP values to a DataFrame, so that we can join the two DataFrames.
isp_df = pl.DataFrame(
data=[[key, value] for key, value in isp_names.items()],
columns=["IP", "ISP_updated"],
orient="row",
)
isp_df
shape: (2, 2)
┌─────────┬─────────────┐
│ IP ┆ ISP_updated │
│ --- ┆ --- │
│ str ┆ str │
╞═════════╪═════════════╡
│ 1.1.1.1 ┆ ABC │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2.2.2.2 ┆ XYZ │
└─────────┴─────────────┘
Now, we simply join
the two DataFrames. The how="left"
ensures that we keep all rows from df
, even if there are no corresponding rows in isp_df
.
df.join(isp_df, on="IP", how="left")
shape: (4, 3)
┌─────────┬─────┬─────────────┐
│ IP ┆ ISP ┆ ISP_updated │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════╪═════╪═════════════╡
│ 1.1.1.1 ┆ N/A ┆ ABC │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2.2.2.2 ┆ N/A ┆ XYZ │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.3.3.3 ┆ PQR ┆ null │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.4.4.4 ┆ N/A ┆ null │
└─────────┴─────┴─────────────┘
Notice the null
values in ISP_updated
. These are cases where you had no updated values for a particular IP
value.
To complete the process, we use fill_null
to copy the values from the ISP
column into the ISP_updated
column for those cases where isp_df
had no updates for a particular IP
value.
(
df
.join(isp_df, on="IP", how="left")
.with_column(
pl.col("ISP_updated").fill_null(pl.col("ISP"))
)
)
shape: (4, 3)
┌─────────┬─────┬─────────────┐
│ IP ┆ ISP ┆ ISP_updated │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════╪═════╪═════════════╡
│ 1.1.1.1 ┆ N/A ┆ ABC │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2.2.2.2 ┆ N/A ┆ XYZ │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.3.3.3 ┆ PQR ┆ PQR │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.4.4.4 ┆ N/A ┆ N/A │
└─────────┴─────┴─────────────┘
Now, your ISP_updated
column has the updated values for each ISP. If you want, you can drop and rename columns so that your final column is labeled ISP
.
(
df
.join(isp_df, on="IP", how="left")
.with_column(
pl.col("ISP_updated").fill_null(pl.col("ISP"))
)
.drop("ISP")
.rename({"ISP_updated": "ISP"})
)
shape: (4, 2)
┌─────────┬─────┐
│ IP ┆ ISP │
│ --- ┆ --- │
│ str ┆ str │
╞═════════╪═════╡
│ 1.1.1.1 ┆ ABC │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2.2.2.2 ┆ XYZ │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 3.3.3.3 ┆ PQR │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 4.4.4.4 ┆ N/A │
└─────────┴─────┘
As the size of your DataFrames gets large, you will definitely want to avoid using for loops. Using join
will be far faster.