I'm new to coding and any help would be appreciated.
Function should take a pandas dataframe as input. Extract the city from a tweet using the has_Dict dictionary given below, and insert the result into a new column named 'City' in the same dataframe. Use the entry np.nan when a city is not found.
My code works when the dictionary and df have the same number of entries but as soon as I add an entry to the df I get "IndexError: list index out of range". I need it to work on a df with more entries than the dictionary. (The dataset is actually bigger, I have created a smaller example here).
import pandas as pd
details = {'Tweets':['Whatever #JHB', 'Yes #CPT']}
df = pd.DataFrame(details)
print(df)
Tweets
0 Whatever #JHB
1 Yes #CPT
hasDict = {'#JHB':'JHB','#CPT':'CPT'}
df['City'] = df['Tweets'].apply(lambda x : [hasDict[city] for city
in hasDict if city in x][0]).fillna(np.nan)
Output
Tweets City
0 Whatever #JHB JHB
1 Yes #CPT CPT
But when the df is bigger:
details = {'Tweets':['Whatever #JHB', 'Yes #CPT', 'Hello #PE']}
I get
IndexError: list index out of range
The below seems to work but I'm trying to figure out the regex part, is it regex after the str.extract? And do I always need to import regex? (I'm not supposed to import anything except pandas and numpy for the assignment).
df['City'] = df['Tweets'].str.extract('('+'|'.join(hasDict.keys())+')', expand=False).map(hasDict).fillna(np.nan)