2

I have tried different combinations to extract the country names from a column and create a new column with solely the countries. I can do it for selected rows i.e. df.address[9998] but not for the whole column.

import pycountry
Cntr = []
for country in pycountry.countries:
    for country.name in df.address:
        Cntr.append(country.name)

Any ideas what is going wrong here?

edit:

address is an object in the df and

df.address[:10] looks like this

       Address
0    Turin, Italy        
1    NaN                 
2    Zurich, Switzerland 
3    NaN                 
4    Glyfada, Greece     
5    Frosinone, Italy    
6    Dublin, Ireland     
7    NaN                 
8    Turin, Italy        
1    NaN                 
2    Zurich, Switzerland 
3    NaN                 
4    Glyfada, Greece     
5    Frosinone, Italy    
6    Dublin, Ireland     
7    NaN                 
8   ...                  
9    Kristiansand, Norway
Name: address, Length: 10, dtype: object

Based on Petar's response when I run individual queries I get the country correctly, but when I try to create a column with all the countries (or ranges like df.address[:5] I get an empty Cntr)

    import pycountry
    Cntr = []
    for country in pycountry.countries:
        if country.name in df['address'][1]:
            Cntr.append(country.name)
Cntr
Returns
[Italy]

and df.address[2] returns [ ] 
etc.

I have also run df['address'] = df['address'].astype('str')

to make sure that there are no floats or int in the column.

Kenan
  • 13,156
  • 8
  • 43
  • 50
newpy
  • 33
  • 1
  • 4
  • Welcome to StackOverflow. See [minimal, reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). We cannot effectively help you until you post your MRE code and accurately specify the problem. We should be able to paste your posted code into a text file and reproduce the problem you specified. Your posted code depends upon an undefined data frame, and you haven't demonstrated a problem for us to fix. Don't forget to trace your program (`print` statements are a good start) to check on data types and contents. – Prune Jan 22 '20 at 18:15
  • can you show the `df`, a for loop is almost never the solution, a regex might be better – Kenan Jan 22 '20 at 22:43
  • Variable and function names should follow the `lower_case_with_underscores` style. I agree with @Kenan, a loop likely isn't necessary here. Also, I would really recommend using `[ ]` for DataFrame column access, instead of the dot/`.`/attribute style. – AMC Jan 22 '20 at 23:13
  • Thank you all ! I will make sure to use proper naming in my code. I have added the first 10 lines of the feature to make in clearer. If there is anything else I can add, please let me know. Also, @Kenan I didn't know how I could make it work with regex. I tried this df['address_new'] = df['address'].astype(str).str.split().str[1], but it did not end up well, so I decided to try pycountry... – newpy Jan 22 '20 at 23:40

3 Answers3

0

You were really close. We cannot loop like this for country.name in df.address. Instead:

import pycountry
Cntr = []
for country in pycountry.countries:
    if country.name in df.address:
        Cntr.append(country.name)

If this does not work, please supply more information because I am unsure what df.address looks like.

Petar Luketina
  • 449
  • 6
  • 18
  • Thank you Petar! I have edited the question based on your response. Unfortunately, although I can get individual countries, I cannot get results for the whole column yet – newpy Jan 22 '20 at 23:33
0

Sample dataframe df = pd.DataFrame({'address': ['Turin, Italy', np.nan, 'Zurich, Switzerland', np.nan, 'Glyfada, greece']})

df[['city', 'country']] = df['address'].str.split(',', expand=True, n=2)

               address     city       country
0         Turin, Italy    Turin         Italy
1                  NaN      NaN           NaN
2  Zurich, Switzerland   Zurich   Switzerland
3                  NaN      NaN           NaN
4      Glyfada, greece  Glyfada        greece
Kenan
  • 13,156
  • 8
  • 43
  • 50
  • I get this error "Columns must be same length as key". I don't know if it relevant but looking at the first 100 values I can see that I have some instances like this "65 Αθηνα" or this "91 France". I thought that it might somehow be related to "nan" values, so I changed them to "None", but the problem persists. I also checked other related questions to this issue but I haven't found anything useful yet. – newpy Jan 23 '20 at 07:13
  • OK. I think it works like this `df[['city', 'country']] = df['address'].str.split(',', expand=True, n=1)` but I guess in this case I lose the countries that are in the format "country" instead of "city, country" – newpy Jan 23 '20 at 07:40
  • I don't think you will try it out, you can always `fillna` in the `country` column with `address` – Kenan Jan 23 '20 at 14:40
  • I finally used this code ''' df[['city', 'or']] = df['Ror'].str.split(',', expand=True, n=1) df['or'].fillna('NaN', inplace=True) ''' – newpy Feb 06 '20 at 15:11
0

You can use the function clean_country() from the library DataPrep. Install it with pip install dataprep.

from dataprep.clean import clean_country
df = pd.DataFrame({"address": ["Turin, Italy", np.nan, "Zurich, Switzerland", np.nan, "Glyfada, Greece"]})
df2 = clean_country(df, "address")
df2
               address address_clean
0         Turin, Italy         Italy
1                  NaN           NaN
2  Zurich, Switzerland   Switzerland
3                  NaN           NaN
4      Glyfada, Greece        Greece
victoria55
  • 225
  • 2
  • 6