5

I am having some text which may or may not contain a country name in it. for example:

' Nigeria: Hotspot Network LTD Rural Telephony Feasibility Study'

this is how I extract the country name from it. in my first attempt:

findcountry("Nigeria: Hotspot Network LTD Rural Telephony Feasibility Study")

def findCountry(stringText):
    for country in pycountry.countries:
        if country.name.lower() in stringText.lower():
            return country.name
    return None

unfortunately, it gives me the wrong output as [Niger] whereas the correct one is Nigeria. Note Niger and Nigeria are two different existing countries in the world.

in second attempt:

def findCountry(stringText):
    full_list =[]
    for country in pycountry.countries:
        if country.name.lower() in stringText.lower():
            full_list.append(country)

    if len(full_list) > 0:
        return full_list

    return None

I get ['Niger', 'Nigeria'] as output. but I can't find a way to get Nigeria as my final output. How to achieve this.

Note: here I know Nigeria is the correct answer but later one I will put it to the code to choose the final country name if present in the text and it should be having very high accuracy for detection.

Talib Daryabi
  • 733
  • 1
  • 6
  • 28
  • https://stackoverflow.com/questions/48607339/how-to-extract-countries-from-a-text this is what you are looking for I suppose. – Maxima May 31 '21 at 04:59
  • Sort countries by the length of their names, in descending order. – Selcuk May 31 '21 at 05:01
  • @Tangent I am using the same library but steps. as I already mentioned I need the correct single answer where I get wrong answer – Talib Daryabi May 31 '21 at 05:01

4 Answers4

7

Always search for longest strings first; this will prevent the kind of error you encountered.

countries = sorted(pycountry.countries, key=lambda x: -len(x))
Amadan
  • 191,408
  • 23
  • 240
  • 301
  • @Aamdan sorry man, I could not understand where and how to use this code. could you please give a hint over it – Talib Daryabi May 31 '21 at 05:19
  • You are iterating over `pycountry.countries`, which is not sorted; iterating over these sorted `countries` instead should give you the correct answer. – Amadan May 31 '21 at 05:33
2

One regex approach would be to build an alternation containing all target countries to be found. Then, use re.findall on the input text to find any possible matches:

regex = r'\b(?:' + '|'.join(pycountry.countries) + r')\b'

def findCountry(stringText):
    countries = re.findall(regex, stringText, flags=re.IGNORECASE)
    return countries
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • it returns me empty list , a small change is required to run the program. inside join method we should write country.name for country in pycountry.countries as it requires text instead of Country object. in final version when I pass my string in findall it returns empty list instead of Nigeria – Talib Daryabi May 31 '21 at 05:11
  • @TalibDaryabi Check the updated answer and try running the regex search in case insensitive mode. – Tim Biegeleisen May 31 '21 at 05:13
  • it still returns me an empty list. I run the code like this: regex = r'\b(?:' + '|'.join(country.name.lower() for country in pycountry.countries) + ')\b' countries = re.findall(regex, title, flags=re.IGNORECASE) – Talib Daryabi May 31 '21 at 05:16
  • title is the stiring having Nigeria in it – Talib Daryabi May 31 '21 at 05:17
  • Lol I have no reading comprehension apparently :D Sorry... – Amadan May 31 '21 at 07:18
2

The problem here is in works for occurrence. So Niger is true for Nigeria. You can also change the placement for variables before and after in but that will solve for Nigeria but not for others. You can use == which will solve all the case.

def findCountry(stringText):
    for country in pycountry.countries:
        if country.name.lower() == stringText.lower():
            return country.name
    return None
moshfiqrony
  • 4,303
  • 2
  • 20
  • 29
0

I got the correct answer like this:

def findCountry(stringText):
    countries = sorted([country.name for country in pycountry.countries] , key=lambda x: -len(x))
    for country in countries:
        if country.lower() in stringText.lower():
            return country
    return None

following @Amandan solution in this question.

Talib Daryabi
  • 733
  • 1
  • 6
  • 28