0

I am working on a project where I need to extract the locations in a given text file. I tried the Named Entity Recognition example given here. The code snippet of this is given below. But here it outputs all the three entities; names, locations, and organizations. Is there any solution to extract only the locations using python?

 import nltk

def extract_entity_names(t):
    entity_names = []

    if hasattr(t, 'label') and t.label:
        if t.label() == 'NE':
            entity_names.append(' '.join([child[0] for child in t]))
        else:
            for child in t:
                entity_names.extend(extract_entity_names(child))

    return entity_names

with open('sample.txt', 'r') as f:
    for line in f:
        sentences = nltk.sent_tokenize(line)
        tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
        tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
        chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)

        entities = []
        for tree in chunked_sentences:
            entities.extend(extract_entity_names(tree))

        print(entities)

1 Answers1

0

You will need to train a Named Entity Recognition (NER) to do that. The NLTK toolkit will give you parts of the speech, not the type of noun it is

If you're looking for a quicker solution. I would recommend the geotext package

from geotext import GeoText
sentence = "my foreigner New York Canberra Sydney Australia, Japan, Fujimoto Godfather Avatar"
places = GeoText(sentence)
print places.countries
print places.cities
usernamenotfound
  • 1,540
  • 2
  • 11
  • 18