1

Why is a pandas.series.extract(regex) able to print the correct values, but won't assign the value to an existing variable using indexing or np.where.

import pandas as pd
import numpy as np

df = pd.DataFrame(
    [
        ['1', np.nan, np.nan, '1 Banana St, 69126 Heidelberg'],
        ['2', "Doloros St", 67898, '2 Choco Rd, 69412 Eberbach']], 
    columns=['id', "Street", 'Postcode', 'FullAddress']
)

m = df['Street'].isna()
print(df["FullAddress"].str.extract(r'(.+?),'))                        # prints street
print(df["FullAddress"].str.extract(r'\b(\d{5})\b'))                   # prints postcode
df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(.+?),')  # outputs NaN
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'\b(\d{5})\b')
# trying where method throws error - NotImplementedError: cannot align with a higher dimensional NDFrame
df["Street"] = df["Street"].where(~(df["Street"].isna()), df["FullAddress"].str.extract(r'(.+?),'))

What I'm trying to do is fill the empty Street and Postcode with the values from FullAddress - without disturbing the existing Street and Postcode values.

There is no problem with the indexing, the regex, or even the extract... I've read the docs, searched for anything similar... What does every get, but I don't understand!?!?!

DrWhat
  • 2,360
  • 5
  • 19
  • 33

2 Answers2

1

You missed expand=False as parameter of str.extract:

>>> df.loc[m, 'FullAddress'].str.extract(r'(.+?),')

             0  # <- it's not a Series but a DataFrame with one column
0  1 Banana St

>>> df.loc[m, 'FullAddress'].str.extract(r'(.+?),', expand=False)

0    1 Banana St
Name: FullAddress, dtype: object  # <- now it's a Series

In the first version, Pandas can't align column labels Street vs 0. In the second version, the Series fit into the Street Series, so:

df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(.+?),', expand=False)
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'\b(\d{5})\b', expand=False)
print(df)

# Output
  id       Street Postcode                    FullAddress
0  1  1 Banana St    69126  1 Banana St, 69126 Heidelberg
1  2   Doloros St  67898.0     2 Choco Rd, 69412 Eberbach

Update*: it's possible to use extract without expand=False using named groups (?P<xxx>...) to align column labels:

df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(?P<Street>.+?),')
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'\b(?P<Postcode>\d{5})\b')

# OR

pattern = r'(?P<Street>.+?),\s*\b(?P<Postcode>\d{5})\b'
df.loc[m, ['Street', 'Postcode']] = df.loc[m, 'FullAddress'].str.extract(pattern)
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • I had a feeling I'd be embarrassed by this question. In my real data there is only ever 1 comma, and only one 5 digit value, so I thought it would align as a dataframe or series. The doc.s actually show why it doesn't. Thanks for the clarification. I'll leave the dumb question with your clever answer in case there is someone else who doesn't know the difference between a series and a dataframe with one column. – DrWhat Mar 31 '23 at 11:16
  • 1
    No you should not. It's a good question. Text functions are not so obvious: difference between extract, extractall and find, findall... – Corralien Mar 31 '23 at 11:20
0

You can use .fillna to fill in the NaN values in your dataframe:

df["Street"] = df["Street"].fillna(df["FullAddress"].str.extract(r'(.+?),')[0])
df["Postcode"] = df["Postcode"].fillna(df["FullAddress"].str.extract(r'\b(\d{5})\b')[0])

This will fill in all of your null values with the result of the extract while keeping all existing values:

  id       Street Postcode                    FullAddress
0  1  1 Banana St    69126  1 Banana St, 69126 Heidelberg
1  2   Doloros St    67898     2 Choco Rd, 69412 Eberbach
rassar
  • 5,412
  • 3
  • 25
  • 41