1

I have a dataset where the title of one column is "What is your location and time zone?"

This has meant that we have entries like

  1. Denmark, CET
  2. Location is Devon, England, GMT time zone
  3. Australia. Australian Eastern Standard Time. +10h UTC.

and even

  1. My location is Eugene, Oregon for most of the year or in Seoul, South Korea depending on school holidays. My primary time zone is the Pacific time zone.
  2. For the entire May I will be in London, United Kingdom (GMT+1). For the entire June I will be in either Norway (GMT+2) or Israel (GMT+3) with limited internet access. For the entire July and August I will be in London, United Kingdom (GMT+1). And then from September, 2015, I will be in Boston, United States (EDT)

Is there any way to extract the city, country and time zone from this?

I was thinking of creating an array (from an open source dataset) with all the country names (including short forms) and also city names / time zones and then if any word in the the dataset matches with a city/country/time zone or short form it fills this into a new column in the same dataset and counts it.

Is this practical?

=========== REPLT BASED ON NLTK ANSWER ============

Running same code as Alecxe I get

Traceback (most recent call last):
  File "E:\SBTF\ntlk_test.py", line 19, in <module>
    tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
  File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\tag\__init__.py", line 110, in pos_tag
    tagger = PerceptronTagger()
  File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\tag\perceptron.py", line 141, in __init__
    self.load(AP_MODEL_LOC)
  File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\tag\perceptron.py", line 209, in load
    self.model.weights, self.tagdict, self.classes = load(loc)
  File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\data.py", line 801, in load
    opened_resource = _open(resource_url)
  File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\data.py", line 924, in _open
    return urlopen(resource_url)
  File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 431, in open
    response = self._open(req, data)
  File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 454, in _open
    'unknown_open', req)
  File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 1265, in unknown_open
    raise URLError('unknown url type: %s' % type)
URLError: <urlopen error unknown url type: c>
GeorgeC
  • 956
  • 5
  • 16
  • 40

1 Answers1

9

I would use what Natural Language Processing and nltk has to offer to extract entities.

Example (heavily based on this gist) which tokenizes each line from a file, splits it into chunks and looks for NE (named entity) labels for every chunk recursively. More explanation here:

import nltk

def extract_entity_names(t):
    entity_names = []

    if hasattr(t, 'label') and t.label:
        if t.label() == 'NE':
            entity_names.append(' '.join([child[0] for child in t]))
        else:
            for child in t:
                entity_names.extend(extract_entity_names(child))

    return entity_names

with open('sample.txt', 'r') as f:
    for line in f:
        sentences = nltk.sent_tokenize(line)
        tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
        tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
        chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)

        entities = []
        for tree in chunked_sentences:
            entities.extend(extract_entity_names(tree))

        print(entities)

For the sample.txt containing:

Denmark, CET
Location is Devon, England, GMT time zone
Australia. Australian Eastern Standard Time. +10h UTC.
My location is Eugene, Oregon for most of the year or in Seoul, South Korea depending on school holidays. My primary time zone is the Pacific time zone.
For the entire May I will be in London, United Kingdom (GMT+1). For the entire June I will be in either Norway (GMT+2) or Israel (GMT+3) with limited internet access. For the entire July and August I will be in London, United Kingdom (GMT+1). And then from September, 2015, I will be in Boston, United States (EDT)

It prints:

['Denmark', 'CET']
['Location', 'Devon', 'England', 'GMT']
['Australia', 'Australian Eastern Standard Time']
['Eugene', 'Oregon', 'Seoul', 'South Korea', 'Pacific']
['London', 'United Kingdom', 'Norway', 'Israel', 'London', 'United Kingdom', 'Boston', 'United States', 'EDT']

The output is not ideal, but might be a good start for you.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • 2
    @Racialz `nltk` is often surprising! I am far from being an expert at NLP, but tried to add some more explanation and links to the further reading. Thanks for asking about the details! – alecxe Mar 28 '16 at 03:19
  • Brilliant. I didn't know about NTLK -I will experiment on this and then (hopefully) accept the answer :-) – GeorgeC Mar 28 '16 at 04:07
  • @alecxe I tried to run the code exactly as you have it after installing the library and it's db's. I get {{raise URLError('unknown url type: %s' % type)}} in urllib2.py but I am not sure why this is even called! any ideas on how I can get your code to work? The Traceback is in my edited question. – GeorgeC Mar 28 '16 at 22:51
  • @GeorgeC looks like this is your problem: http://stackoverflow.com/questions/35827859/python-nltk-pos-tag-throws-urlerror. Check it out. – alecxe Mar 29 '16 at 13:07
  • doesn't work for `10906 woodley ava granada hills CA` addresses like this – Abhinav Anand Dec 17 '17 at 07:17
  • Quick tip for anyone trying this with a large amount of text, see https://stackoverflow.com/questions/33676526/pos-tagger-is-incredibly-slow - creating PerceptronTagger once at the start is orders of magnitude quicker – barryhunter Mar 17 '18 at 13:07
  • @alecxe Great answer, but in my case it is detecting names also, even words like `delete, height, weight`, is there a way to ignore those words? Thanks – Jeril Apr 26 '18 at 09:21
  • this seems to only work if the text is in English. Any idea for non English text? – catris25 May 11 '21 at 04:05