-1

While trying to convert spaCy NER dataset format to Flair format, using this code:

from spacy.gold import biluo_tags_from_offsets
nlp = spacy.load("en_core_web_md")

ents = TRAIN_DATA

with open("flair_ner.txt","w") as f:
    for sent,tags in ents:
        doc = nlp(sent)
        biluo = biluo_tags_from_offsets(doc,tags['entities'])
        for word,tag in zip(doc, biluo):
            f.write(f"{word} {tag}\n")
        f.write("\n")

I am experencing an overlapping error:

ValueError: [E103] Trying to set conflicting doc.ents: '(1155, 1199, 'Email Address')' and '(1143, 1240, 'Links')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

Here is the example:

[('Afreen Jamadar\nActive member of IIIT Committee in Third year\n\nSangli, Maharashtra - Email me on Indeed: indeed.com/r/Afreen-Jamadar/8baf379b705e37c6\n\nI wish to use my knowledge, skills and conceptual understanding to create excellent team\nenvironments and work consistently achieving organization objectives believes in taking initiative\nand work to excellence in my work.\n\nWORK EXPERIENCE\n\nActive member of IIIT Committee in Third year\n\nCisco Networking -  Kanpur, Uttar Pradesh\n\norganized by Techkriti IIT Kanpur and Azure Skynet.\nPERSONALLITY TRAITS:\n• Quick learning ability\n• hard working\n\nEDUCATION\n\nPG-DAC\n\nCDAC ACTS\n\n2017\n\nBachelor of Engg in Information Technology\n\nShivaji University Kolhapur -  Kolhapur, Maharashtra\n\n2016\n\nSKILLS\n\nDatabase (Less than 1 year), HTML (Less than 1 year), Linux. (Less than 1 year), MICROSOFT\nACCESS (Less than 1 year), MICROSOFT WINDOWS (Less than 1 year)\n\nADDITIONAL INFORMATION\n\nTECHNICAL SKILLS:\n\n• Programming Languages: C, C++, Java, .net, php.\n• Web Designing: HTML, XML\n• Operating Systems: Windows […] Windows Server 2003, Linux.\n• Database: MS Access, MS SQL Server 2008, Oracle 10g, MySql.\n\nhttps://www.indeed.com/r/Afreen-Jamadar/8baf379b705e37c6?isid=rex-download&ikw=download-top&co=IN',
  {'entities': [(1155, 1199, 'Email Address'),
    (1143, 1240, 'Links'),
    (743, 1141, 'Skills'),
    (729, 733, 'Graduation Year'),
    (706, 728, 'Location'),
    (675, 703, 'College Name'),
    (631, 673, 'Degree'),
    (625, 630, 'Graduation Year'),
    (614, 623, 'College Name'),
    (606, 612, 'Degree'),
    (458, 479, 'Location'),
    (438, 454, 'Companies worked at'),
    (104, 148, 'Email Address'),
    (62, 68, 'Location'),
    (0, 14, 'Name')]}),
gph
  • 1,045
  • 8
  • 25
BAKYAC
  • 155
  • 2
  • 12
  • The error message specifies why there's a problem: a token can only be part of one named entity. – Sofie VL Dec 07 '20 at 09:47
  • How can resolve this error on the whole dataset? – BAKYAC Dec 07 '20 at 09:55
  • You'll have to write a script that filters/merges/cleans your annotations before you feed them to spaCy. In general, overlapping entities are usually indicative of an inconsistent annotation scheme. In this specific case, it seems weird that something would be a LINK and an EMAIL at the same time, for instance. – Sofie VL Dec 07 '20 at 12:19
  • Is there an easy to work with labeling software for Flair NER format? so that even the HRs can help me in the task... – BAKYAC Dec 08 '20 at 08:22

1 Answers1

0

from prodigy/spacy support

The entity recognizer is constrained to predict only non-overlapping, non-nested >spans. The training data should obey the same constraint. If you like, you could >have two sentences with the different annotations in your data. I’m not sure >whether this would hurt or help your performance, though.

I can see from the error message that the span for email (start span:1155, end span:1199) and links (start span:1143, end span:1240) overlaps. You need to resolve overlapping annotations before you can use your code.

iEriii
  • 403
  • 2
  • 7