Creating a new column of a data frame using a dictionary with partial string matching

Question

I'm new to coding and any help would be appreciated.

Function should take a pandas dataframe as input. Extract the city from a tweet using the has_Dict dictionary given below, and insert the result into a new column named 'City' in the same dataframe. Use the entry np.nan when a city is not found.

My code works when the dictionary and df have the same number of entries but as soon as I add an entry to the df I get "IndexError: list index out of range". I need it to work on a df with more entries than the dictionary. (The dataset is actually bigger, I have created a smaller example here).

import pandas as pd
details = {'Tweets':['Whatever #JHB', 'Yes #CPT']}
df = pd.DataFrame(details)
print(df)
          Tweets
0  Whatever #JHB
1       Yes #CPT


hasDict = {'#JHB':'JHB','#CPT':'CPT'}

df['City'] = df['Tweets'].apply(lambda x : [hasDict[city] for city 
in hasDict if city in x][0]).fillna(np.nan)

Output
            Tweets  City
0   Whatever #JHB   JHB
1       Yes #CPT    CPT

But when the df is bigger:

details = {'Tweets':['Whatever #JHB', 'Yes #CPT', 'Hello #PE']}

I get

IndexError: list index out of range

The below seems to work but I'm trying to figure out the regex part, is it regex after the str.extract? And do I always need to import regex? (I'm not supposed to import anything except pandas and numpy for the assignment).

df['City'] = df['Tweets'].str.extract('('+'|'.join(hasDict.keys())+')', expand=False).map(hasDict).fillna(np.nan)

Hi. Yes, there are hashtags in the original df column that will not match any keys in the dictionary, I hope I've understood and that answers the question. — Phillippa, May 10 '23 at 07:19

Corralien · Answer 1 · 2023-05-10T12:28:52.177

0

You can use map after explode (tokenize) your dataframe:

df['City'] = df['Tweets'].str.findall(r"(#\w+)").explode().map(hasDict)
print(df)

# Output
          Tweets City
0  Whatever #JHB  JHB
1       Yes #CPT  CPT
2      Hello #PE  NaN

About regex:

(        <- start of capture group (what I want in the output)
  #      <- hashtag
    \w+  <- any characters a-z A-Z 0-9 _
)        <- end of capture group

You don't need to import re module, Pandas already did for its string methods.

edited May 10 '23 at 12:28

answered May 09 '23 at 22:30

Corralien

109,409
8
28
52

Thank you, this seems to add the value to the new column in the row where the key first appears. But if the key appears in a row below that it does not add the value to that row. Does that make sense? – Phillippa May 10 '23 at 07:31
Can you provide an example in `details` variable where it doesn't work, please? – Corralien May 10 '23 at 07:39
Sorry, I think I figured out that the problem is actually different. It's when, for example, #JHB appears as #JHB: in the original column. The colon seems to result in the value not being added to the new column, instead NaN appears in the new column. – Phillippa May 10 '23 at 07:48
@Phillippa. I updated my answer according your comment. Can you check it please? – Corralien May 10 '23 at 09:17
Thank you. I get this error: ValueError: cannot reindex on an axis with duplicate labels – Phillippa May 10 '23 at 11:02
I have added a different line to my original post that seems to work, I just can't find where I got this from right now so I am not entirely sure I completely understand the regex part. Do you have any comments? I have seen that sometimes people import regex before using it but mine seems to work without importing, is that right? Thank you so much for the help, I really appreciate it. – Phillippa May 10 '23 at 11:38
@Phillippa. For the ValueError, it's because `findall` finds multiple match in one tweet (multiple cities) rather than `extract` takes only the first one. What do you want to do if there are multiple cities? Keep only the first or keep them all? – Corralien May 10 '23 at 12:33
I’m not sure right now, the expected output does not show these entries so I’ll have to submit and see whether it’s correct and try again if not. I imagine I would need to keep multiple cities from the same tweet if they have different names. Thank you. – Phillippa May 10 '23 at 13:31

score 0 · Answer 2 · answered May 09 '23 at 23:43

0

You don't actually need the dictionary mapping as long as the cities always begin with #, as you can then use a regex:

import re
df["City"] = df["Tweets"].apply(lambda x: re.search("([^#]+)$", x).group(1))

answered May 09 '23 at 23:43

ags29

2,621
1
8
14

Thank you. I didn't mention that I don't think I am allowed to import anything other than pandas and numpy for the assignment but thank you. – Phillippa May 10 '23 at 07:21
you can use essentially the same solution with pandas only as follows: `df["Tweets"].str.extract("([^#]+)$")` – ags29 May 10 '23 at 14:37

score 0 · Answer 3 · answered May 10 '23 at 11:22

This is how I implemented your code and it works (I attached an image to this answer):

import pandas as pd
import numpy as np
details = {'Tweets':['Whatever #JHB', 'Yes #CPT']}
df = pd.DataFrame(details)
hasDict = {'#JHB':'JHB','#CPT':'CPT'}
def find_city(tweet):
  cities = hasDict.keys()
  for city in cities:
    if city in tweet:
      return hasDict[city]
  return np.nan
df['City'] = df['Tweets'].apply(lambda x : find_city(x))
print(df.head(5))

here is the image of my terminal when I ran this code and the output i got: I hope it helps.

Creating a new column of a data frame using a dictionary with partial string matching

3 Answers3