Extract country name from text in column to create another column

Question

I have tried different combinations to extract the country names from a column and create a new column with solely the countries. I can do it for selected rows i.e. df.address[9998] but not for the whole column.

import pycountry
Cntr = []
for country in pycountry.countries:
    for country.name in df.address:
        Cntr.append(country.name)

Any ideas what is going wrong here?

edit:

address is an object in the df and

df.address[:10] looks like this

       Address
0    Turin, Italy        
1    NaN                 
2    Zurich, Switzerland 
3    NaN                 
4    Glyfada, Greece     
5    Frosinone, Italy    
6    Dublin, Ireland     
7    NaN                 
8    Turin, Italy        
1    NaN                 
2    Zurich, Switzerland 
3    NaN                 
4    Glyfada, Greece     
5    Frosinone, Italy    
6    Dublin, Ireland     
7    NaN                 
8   ...                  
9    Kristiansand, Norway
Name: address, Length: 10, dtype: object

Based on Petar's response when I run individual queries I get the country correctly, but when I try to create a column with all the countries (or ranges like df.address[:5] I get an empty Cntr)

    import pycountry
    Cntr = []
    for country in pycountry.countries:
        if country.name in df['address'][1]:
            Cntr.append(country.name)
Cntr
Returns
[Italy]

and df.address[2] returns [ ] 
etc.

I have also run df['address'] = df['address'].astype('str')

to make sure that there are no floats or int in the column.

Welcome to StackOverflow. See [minimal, reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). We cannot effectively help you until you post your MRE code and accurately specify the problem. We should be able to paste your posted code into a text file and reproduce the problem you specified. Your posted code depends upon an undefined data frame, and you haven't demonstrated a problem for us to fix. Don't forget to trace your program (`print` statements are a good start) to check on data types and contents. — Prune, Jan 22 '20 at 18:15
can you show the `df`, a for loop is almost never the solution, a regex might be better — Kenan, Jan 22 '20 at 22:43
Variable and function names should follow the `lower_case_with_underscores` style. I agree with @Kenan, a loop likely isn't necessary here. Also, I would really recommend using `[ ]` for DataFrame column access, instead of the dot/`.`/attribute style. — AMC, Jan 22 '20 at 23:13
Thank you all ! I will make sure to use proper naming in my code. I have added the first 10 lines of the feature to make in clearer. If there is anything else I can add, please let me know. Also, @Kenan I didn't know how I could make it work with regex. I tried this df['address_new'] = df['address'].astype(str).str.split().str[1], but it did not end up well, so I decided to try pycountry... — newpy, Jan 22 '20 at 23:40

score 0 · Answer 1 · answered Jan 22 '20 at 22:33

0

You were really close. We cannot loop like this for country.name in df.address. Instead:

import pycountry
Cntr = []
for country in pycountry.countries:
    if country.name in df.address:
        Cntr.append(country.name)

If this does not work, please supply more information because I am unsure what df.address looks like.

answered Jan 22 '20 at 22:33

Petar Luketina

449
6
18

Thank you Petar! I have edited the question based on your response. Unfortunately, although I can get individual countries, I cannot get results for the whole column yet – newpy Jan 22 '20 at 23:33

score 0 · Accepted Answer · answered Jan 23 '20 at 00:12

0

Sample dataframe df = pd.DataFrame({'address': ['Turin, Italy', np.nan, 'Zurich, Switzerland', np.nan, 'Glyfada, greece']})

df[['city', 'country']] = df['address'].str.split(',', expand=True, n=2)

               address     city       country
0         Turin, Italy    Turin         Italy
1                  NaN      NaN           NaN
2  Zurich, Switzerland   Zurich   Switzerland
3                  NaN      NaN           NaN
4      Glyfada, greece  Glyfada        greece

answered Jan 23 '20 at 00:12

Kenan

13,156
8
43
50

I get this error "Columns must be same length as key". I don't know if it relevant but looking at the first 100 values I can see that I have some instances like this "65 Αθηνα" or this "91 France". I thought that it might somehow be related to "nan" values, so I changed them to "None", but the problem persists. I also checked other related questions to this issue but I haven't found anything useful yet. – newpy Jan 23 '20 at 07:13
OK. I think it works like this `df[['city', 'country']] = df['address'].str.split(',', expand=True, n=1)` but I guess in this case I lose the countries that are in the format "country" instead of "city, country" – newpy Jan 23 '20 at 07:40
I don't think you will try it out, you can always `fillna` in the `country` column with `address` – Kenan Jan 23 '20 at 14:40
I finally used this code ''' df[['city', 'or']] = df['Ror'].str.split(',', expand=True, n=1) df['or'].fillna('NaN', inplace=True) ''' – newpy Feb 06 '20 at 15:11

victoria55 · Answer 3 · 2021-02-23T05:23:22.937

You can use the function clean_country() from the library DataPrep. Install it with pip install dataprep.

from dataprep.clean import clean_country
df = pd.DataFrame({"address": ["Turin, Italy", np.nan, "Zurich, Switzerland", np.nan, "Glyfada, Greece"]})
df2 = clean_country(df, "address")
df2
               address address_clean
0         Turin, Italy         Italy
1                  NaN           NaN
2  Zurich, Switzerland   Switzerland
3                  NaN           NaN
4      Glyfada, Greece        Greece

Extract country name from text in column to create another column

3 Answers3