3

Hi I'm trying to extract the full company name from a string description about the company with bert-base-ner. I am also open to trying other methods but I couldn't really find one. The issue is that although it tags the orgs correctly, it tags it by word/token so I can't easily extract the full company name without having to concat and build it myself.

Is there an easier way or model to do this?

Here is my code:

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)

ner_results = nlp(text1)
print(ner_results)

Here is my output for one text string:

[{'entity': 'B-ORG', 'score': 0.99965024, 'index': 1, 'word': 'Orion', 'start': 0, 'end': 5}, {'entity': 'I-ORG', 'score': 0.99945647, 'index': 2, 'word': 'Metal', 'start': 6, 'end': 11}, {'entity': 'I-ORG', 'score': 0.99943095, 'index': 3, 'word': '##s', 'start': 11, 'end': 12}, {'entity': 'I-ORG', 'score': 0.99939036, 'index': 4, 'word': 'Limited', 'start': 13, 'end': 20}, {'entity': 'B-LOC', 'score': 0.9997398, 'index': 14, 'word': 'Australia', 'start': 78, 'end': 87}]
Dana
  • 41
  • 3

2 Answers2

2

I have faced a similar issue and solved it by using a better model called "xlm-roberta-large-finetuned-conll03-English" which is much better than the one you're using right now and will render the complete organization's name rather than the broken pieces. Feel free to test out the below-mentioned code which will extract the full organization's list from the document. Accept my answer by clicking on tick button if it founds useful.

from transformers import pipeline
from subprocess import list2cmdline
from pdfminer.high_level import extract_text
import docx2txt
import spacy
from spacy.matcher import Matcher
import time
start = time.time()
nlp = spacy.load('en_core_web_sm')
model_checkpoint = "xlm-roberta-large-finetuned-conll03-english"
token_classifier = pipeline(
    "token-classification", model=model_checkpoint, aggregation_strategy="simple"
)



def text_extraction(file):
    """"
    To extract texts from both pdf and word
    """
    if file.endswith(".pdf"):
        return extract_text(file)
    else:
        resume_text = docx2txt.process(file)
    if resume_text:
        return resume_text.replace('\t', ' ')
    return None



# Organisation names extraction
def org_name(file):
    # Extract the complete text in the resume
    extracted_text = text_extraction(file)
    classifier = token_classifier(extracted_text)
    # Get the list of dictionary with key value pair "entity":'ORG'
    values = [item for item in classifier if item["entity_group"] == "ORG"]
    # Get the list of dictionary with key value pair "entity":'ORG'
    res = [sub['word'] for sub in values]
    final1 = list(set(res))  # Remove duplicates
    final = list(filter(None, final1)) # Remove empty strings
    print(final)

       
org_name("your file name")

end = time.time()

print("The time of execution of above program is :", round((end - start), 2))
Vivek Menon M
  • 64
  • 1
  • 1
  • 8
0

Alternatively, you can keep your original code and model, and try this setting instead: aggregation_strategy="simple":

In your code, add it as an extra parameter: nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

It was suggested in this in the post below: How to reconstruct text entities with Hugging Face's transformers pipelines without IOB tags?

Jeru Luke
  • 20,118
  • 13
  • 80
  • 87
DDA
  • 1
  • 1