2

I'm trying to pass a concatenated list of strings as the regular expression to re.findall:

re.findall(regex, string)

But I'm getting just a bunch of empty strings in a pair of lists as a result.

re.findall("|".join(locations), 'Zika Outbreak Hits Miami'.lower())
# [('', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''), ('', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '')]

Where locations is a list like this:

['andorra', 'united arab emirates', 'afghanistan', 'antigua and barbuda', 'anguilla', 'albania', 'armenia', 'angola', 'antarctica', 'argentina', 'american samoa', 'austria', 'australia', 'aruba', 'aland islands', 'azerbaijan', 'bosnia and herzegovina', 'barbados', 'bangladesh', 'belgium', 'burkina faso', 'bulgaria', 'bahrain', 'burundi', 'benin', 'saint barthelemy', 'bermuda', 'brunei', 'bolivia', 'bonaire, saint eustatius and saba ', 'brazil', 'bahamas', 'bhutan', 'bouvet island', 'botswana', 'belarus', 'belize', 'canada', 'cocos islands', 'democratic republic of the congo', 'central african republic', 'republic of the congo', 'switzerland', 'ivory coast', 'cook islands', 'chile', 'cameroon', 'china', 'colombia', 'costa rica', 'cuba', ...]

A manual test work like this:

print(re.findall('miami|zika', 'Zika Outbreak Hits Miami'.lower()))
# ['zika', 'miami']

But I don't know what's wrong with concatenating locations to create a big regex. Maybe is that? locations holds 24588 elements.

I'm currently creating the locations list from what geonamescache offers as cities and countries:

import geonamescache

gc = geonamescache.GeonamesCache()
countries = [country["name"].lower() for country in list(gc.get_countries().values())]
cities    = [city["name"].lower() for city in list(gc.get_cities().values())]
locations =  countries + cities

The text which I'm working with looks like this:

Zika Outbreak Hits Miami
Could Zika Reach New York City?
First Case of Zika in Miami Beach
Mystery Virus Spreads in Recife, Brazil
Dallas man comes down with case of Zika
Karol Karol
  • 566
  • 3
  • 7
  • 26
  • 1
    Check your locations list. For example, `re.findall("|".join([str(n) for n in range(100000)]+["miami","zika"]), 'Zika Outbreak Hits Miami'.lower())` works without a problem – Maximilian Janisch Dec 03 '19 at 09:44
  • Also I'd advise you with such a long regex (which you probably use more than once) to compile your regex once, this may fasten up your search quite a bit – LeoE Dec 03 '19 at 09:52
  • Thanks @MaximilianJanisch, that returns `['zika', 'miami']`, working fine it seems. – Karol Karol Dec 03 '19 at 10:57
  • Thanks @LeoE, didn't know that, I'm gonna keep it mind ;) – Karol Karol Dec 03 '19 at 10:57
  • So I had a look at it and you have quite a lot of special characters, maybe one of them leads to those reults? Your special characters are ! " # $ % & ' , - / 0 1 2 3 4 5 6 7 8 9 ` a b c d e f g h i j k l m n o p q r s t u v w x y z ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý ÿ ā ă ą ć č ď đ ē ė ę ě ğ ĩ ī ĭ ı ľ ł ń ň ō ŏ ő œ ř ś ş š ţ ť ũ ū ŭ ů ź ż ž ơ ư ș ț ə ̄ ̇ ̧ ̱ а б д е ж з и м н о р у ḍ ḏ ḑ ḥ ḩ ḯ ṅ ṭ ẕ ẖ ạ ả ấ ầ ẩ ắ ằ ế ệ ỉ ị ọ ố ộ ờ ủ ỳ ỹ ‘ ’ Maybe someone knows about it? I don't – LeoE Dec 03 '19 at 13:46
  • The thing is I tried just hardcoding some of those country/city names with special characters and it works fine, with fine meaning it doesn't return empty strings as in the example I shown in the question. Should I normalize all those names with special characters? – Karol Karol Dec 03 '19 at 15:03

1 Answers1

2

Take a look at your locations list and look for empty strings or anomalous location names in the list.

For example: This works well

In [1]: locations = ['andorra', 'united arab emirates', 'afghanistan', 'antigua and barbuda', 'anguilla', 'albania', 'armenia', 'angola', 'antarctica', 'argentina', 'american samoa', 'austria', 'australia', 'aruba', 'aland islands', 'azerbaijan', 'bosnia and herzegovina', 'barbados', 'bangladesh', 'belgium', 'burkina faso', 'bulgaria', 'bahrain', 'burundi', 'benin', 'saint barthelemy', 'bermuda', 'brunei', 'bolivia', 'bonaire, saint eustatius and saba ', 'brazil', 'bahamas', 'bhutan', 'bouvet island', 'botswana', 'belarus', 'belize', 'canada', 'cocos islands', 'democratic republic of the congo', 'central african republic', 'republic of the congo', 'switzerland', 'ivory coast', 'cook islands', 'chile', 'cameroon', 'china', 'colombia', 'costa rica', 'cuba']

In [2]: import re

In [3]: re.findall("|".join(locations), 'Zika Outbreak Hits Miami'.lower())
Out[3]: []

In [4]: re.findall("|".join(locations), 'switzerland has lot of mountains'.lower())
Out[4]: ['switzerland']

And this doesn't because there is an empty location in my list

In [5]: locations = ['andorra', 'united arab emirates', 'afghanistan', 'antigua and barbuda', 'anguilla', 'albania', 'armenia', 'angola', 'antarctica', 'argentina', 'american samoa', 'austria', 'australia', 'aruba', 'aland islands', 'azerbaijan', 'bosnia and herzegovina', 'barbados', 'bangladesh', 'belgium', 'burkina faso', 'bulgaria', 'bahrain', 'burundi', 'benin', 'saint barthelemy', 'bermuda', 'brunei', 'bolivia', 'bonaire, saint eustatius and saba ', 'brazil', 'bahamas', 'bhutan', 'bouvet island', 'botswana', 'belarus', 'belize', 'canada', 'cocos islands', 'democratic republic of the congo', 'central african republic', 'republic of the congo', 'switzerland', 'ivory coast', 'cook islands', 'chile', 'cameroon', 'china', 'colombia', 'costa rica', 'cuba', '']

In [6]: re.findall("|".join(locations), 'switzerland has lot of mountains'.lower())
Out[6]:
['switzerland',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '']

EDIT

As expected, the special characters in locations are causing the problem in the code. You can use the following code to create the regex itself, it's mostly places like which are interfering with the regular expressions:

In [21]: [l for l in locations if l.find('(') >= 0]
Out[21]:
['zürich (kreis 11) / seebach',
 'zürich (kreis 11) / oerlikon',
 'zürich (kreis 10) / höngg',
 'zürich (kreis 4) / aussersihl',
 'zürich (kreis 10) / wipkingen',
 'zürich (kreis 11) / affoltern',
 'zürich (kreis 2) / wollishofen',
 'zürich (kreis 3) / sihlfeld',
 'zürich (kreis 6) / unterstrass',
 'zürich (kreis 9) / albisrieden',
 'zürich (kreis 9) / altstetten',
 'stadt winterthur (kreis 1)',
 'zürich (kreis 12)',
 'seen (kreis 3)',
 'zürich (kreis 3)',
 'zürich (kreis 11)',
 'zürich (kreis 9)',
 'oberwinterthur (kreis 2)',
 'zürich (kreis 10)',
 'zürich (kreis 2)',
 'zürich (kreis 8)',
 'zürich (kreis 7)',
 'zürich (kreis 6)',
 'wetter (ruhr)',
 'schwedt (oder)',
 'kempten (allgäu)',
 'kelkheim (taunus)',
 'halle (saale)',
 'frankfurt (oder)',
 'brake (unterweser)',
 'v.s.k.valasai (dindigul-dist.)',
 'dainava (kaunas)',
 'miguel alemán (la doce)',
 'jardines de la silla (jardines)',
 'licenciado benito juárez (campo gobierno)',
 'ampliación san mateo (colonia solidaridad)',
 'kalibo (poblacion)',
 'city of milford (balance)',
 'butte-silver bow (balance)']

Create the regex using re.escape to take care of the special characters. You may also want to do a complete word match otherwise, partial words like brea from break will match

In [21]: locations_regex = re.compile(r'|'.join([re.escape(l) for l in sorted(locations, key=lambda x:-len(x))]))
Sharath
  • 216
  • 2
  • 11