how to update the polars dataframe

Question

I want to update a polars library dataframe,

polars syntax/command which I used for the purpose: df[0, 'A'] = 'some value'

but the above code gives an error: ValueError: cannot set with list/tuple as value; use a scalar value I am using polars 0.13.55

The above coode was previously working in polars 0.13.51

Minimal Code to reproduce the problem:

df = pl.DataFrame( { "IP": ['1.1.1.1', '2.2.2.2'], "ISP" :   
                    ["N/A", "N/A"] } )

isp_names = { '1.1.1.1' : 'ABC', '2.2.2.2' : 'XYZ' }

i = 0
for row in df.rows():
    for ip, isp in isp_names.items():
        if(row[0] == ip):
            df[i, 'ISP'] = isp #**This line gives the Value error**
    i = i + 1

Have you got a minimal working example to reproduce your problem? — ritchie46, Jul 20 '22 at 10:40
I updated the question with the minimal code to reproduce the problem. — Wajahat Raza, Jul 22 '22 at 05:18

score 1 · Answer 1 · answered Jul 24 '22 at 14:08

It looks as though you might be trying to update the values of DataFrame, particularly where values are missing (the "N/A" values).

In addition the advice of @jvz, I would recommend using a left join for your purposes, rather than using a dictionary and a for loop. Using for loops is very slow, and is to be avoided. By contrast, a left join will be very performant, and is built for exactly these types of situations.

We'll take this in steps.

First, let's first expand your example.

df = pl.DataFrame(
    {"IP": ["1.1.1.1", "2.2.2.2", "3.3.3.3", "4.4.4.4"],
     "ISP": ["N/A", "N/A", "PQR", "N/A"]}
)
df

shape: (4, 2)
┌─────────┬─────┐
│ IP      ┆ ISP │
│ ---     ┆ --- │
│ str     ┆ str │
╞═════════╪═════╡
│ 1.1.1.1 ┆ N/A │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2.2.2.2 ┆ N/A │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 3.3.3.3 ┆ PQR │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 4.4.4.4 ┆ N/A │
└─────────┴─────┘

Notice that we have three rows with "N/A" values, but one row that already has a valid value, "PQR".

Next, let's convert your dictionary of updated ISP values to a DataFrame, so that we can join the two DataFrames.

isp_df = pl.DataFrame(
    data=[[key, value] for key, value in isp_names.items()],
    columns=["IP", "ISP_updated"],
    orient="row",
)
isp_df

shape: (2, 2)
┌─────────┬─────────────┐
│ IP      ┆ ISP_updated │
│ ---     ┆ ---         │
│ str     ┆ str         │
╞═════════╪═════════════╡
│ 1.1.1.1 ┆ ABC         │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2.2.2.2 ┆ XYZ         │
└─────────┴─────────────┘

Now, we simply join the two DataFrames. The how="left" ensures that we keep all rows from df, even if there are no corresponding rows in isp_df.

df.join(isp_df, on="IP", how="left")

shape: (4, 3)
┌─────────┬─────┬─────────────┐
│ IP      ┆ ISP ┆ ISP_updated │
│ ---     ┆ --- ┆ ---         │
│ str     ┆ str ┆ str         │
╞═════════╪═════╪═════════════╡
│ 1.1.1.1 ┆ N/A ┆ ABC         │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2.2.2.2 ┆ N/A ┆ XYZ         │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.3.3.3 ┆ PQR ┆ null        │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.4.4.4 ┆ N/A ┆ null        │
└─────────┴─────┴─────────────┘

Notice the null values in ISP_updated. These are cases where you had no updated values for a particular IP value.

To complete the process, we use fill_null to copy the values from the ISP column into the ISP_updated column for those cases where isp_df had no updates for a particular IP value.

(
    df
    .join(isp_df, on="IP", how="left")
    .with_column(
        pl.col("ISP_updated").fill_null(pl.col("ISP"))
    )
)

shape: (4, 3)
┌─────────┬─────┬─────────────┐
│ IP      ┆ ISP ┆ ISP_updated │
│ ---     ┆ --- ┆ ---         │
│ str     ┆ str ┆ str         │
╞═════════╪═════╪═════════════╡
│ 1.1.1.1 ┆ N/A ┆ ABC         │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2.2.2.2 ┆ N/A ┆ XYZ         │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.3.3.3 ┆ PQR ┆ PQR         │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.4.4.4 ┆ N/A ┆ N/A         │
└─────────┴─────┴─────────────┘

Now, your ISP_updated column has the updated values for each ISP. If you want, you can drop and rename columns so that your final column is labeled ISP.

(
    df
    .join(isp_df, on="IP", how="left")
    .with_column(
        pl.col("ISP_updated").fill_null(pl.col("ISP"))
    )
    .drop("ISP")
    .rename({"ISP_updated": "ISP"})
)

shape: (4, 2)
┌─────────┬─────┐
│ IP      ┆ ISP │
│ ---     ┆ --- │
│ str     ┆ str │
╞═════════╪═════╡
│ 1.1.1.1 ┆ ABC │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2.2.2.2 ┆ XYZ │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 3.3.3.3 ┆ PQR │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 4.4.4.4 ┆ N/A │
└─────────┴─────┘

As the size of your DataFrames gets large, you will definitely want to avoid using for loops. Using join will be far faster.

score 0 · Answer 2 · answered Jul 24 '22 at 08:24

I am unable to reproduce the error on the latest version (0.13.56), so updating polars may help.

May I also suggest two improvements to the code, where the second improvements avoids the issue you run into altogether?

First, a more Pythonic version:

df = pl.DataFrame( {"IP": ['1.1.1.1', '2.2.2.2'], 
                    "ISP": ["N/A", "N/A"] } )
isp_names = { '1.1.1.1' : 'ABC', '2.2.2.2' : 'XYZ' }

for i, row in enumerate(df.rows()):
    df[i, 'ISP'] = isp_names[row[0]]

I.e., use enumerate to keep your i aligned with row, and do not loop isp_names separately but simply get the value by the key.

Second, Polars has an excellent expression system, meaning you do not have to pre-allocate ISP column or write a loop:

df = pl.DataFrame( { "IP": ['1.1.1.1', '2.2.2.2']})
isp_names = { '1.1.1.1' : 'ABC', '2.2.2.2' : 'XYZ' }
df.with_column(pl.col("IP").apply(isp_names.get).alias("ISP"))

which returns df as:

shape: (2, 2)
┌─────────┬─────┐
│ IP      ┆ ISP │
│ ---     ┆ --- │
│ str     ┆ str │
╞═════════╪═════╡
│ 1.1.1.1 ┆ ABC │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2.2.2.2 ┆ XYZ │
└─────────┴─────┘

how to update the polars dataframe

2 Answers2