0

Sample of trainning data(input.json), the full json has only 100 resumes.

{"content": "Resume 1 text in french","annotation":[{"label":["diplomes"],"points":[{"start":1233,"end":1423,"text":"1995-1996 : Lycée  Dar Essalam                                                                     Rabat     \n                        Baccalauréat scientifique option sciences Expérimentales "}]},{"label":["diplomes"],"points":[{"start":1012,"end":1226,"text":"1996-1998 : Faculté des Sciences                                                                          Rabat \n                  C.E.U.S (Certificat des Etudes universitaires Supérieurs) option physique et chimie "}]},{"label":["diplomes"],"points":[{"start":812,"end":1004,"text":"1999-2000 : Faculté des Sciences                                                                           Rabat \n                            Licence es sciences physique  option électronique  "}]},{"label":["diplomes"],"points":[{"start":589,"end":805,"text":"2002-2004 : Faculté des Sciences                                                                           Rabat  \nDESA ((Diplôme des Etudes Supérieures Approfondies)  en informatique   \n\ntélécommunication multimédia "}]},{"label":["diplomes"],"points":[{"start":365,"end":582,"text":"2014-2017 : Institut National des Postes et Télécommunications INPT                 Rabat                                           \n                             Thèse de doctorat en informatique et télécommunication  "}]},{"label":["adresse"],"points":[{"start":122,"end":157,"text":"Rue 34 n 17 Hay Errachad Rabat Maroc"}]}],"extras":null,"metadata":{"first_done_at":1586140561000,"last_updated_at":1586140561000,"sec_taken":0,"last_updated_by":"wP21IMXff9TFSNLNp5v0fxbycFX2","status":"done","evaluation":"NONE"}}


{"content": "Resume 2 text in french","annotation":[{"label":["diplomes"],"points":[{"start":1251,"end":1345,"text":"Lycée Oued El Makhazine - Meknès \n\n- Bachelier mention très bien \n- Option : Sciences physiques"}]},{"label":["diplomes"],"points":[{"start":1122,"end":1231,"text":"Classes préparatoires Moulay Youssef - Rabat \n\n- Admis au Concours National Commun CNC \n- Option : PCSI - PSI "}]},{"label":["diplomes"],"points":[{"start":907,"end":1101,"text":"Institut National des Postes et Télécommunications INPT - Rabat \n\n- Ingénieur d’État en Télécommunications et technologies de l’information \n- Option : MTE Management des Télécoms de l’entreprise"}]},{"label":["adresse"],"points":[{"start":79,"end":133,"text":"94, Hay El Izdihar, Avenue El Massira, Ouislane, MEKNES"}]}],"extras":null,"metadata":{"first_done_at":1586126476000,"last_updated_at":1586325851000,"sec_taken":0,"last_updated_by":"wP21IMXff9TFSNLNp5v0fxbycFX2","status":"done","evaluation":"NONE"}}


{"content": "Resume 3 text in french","annotation":[{"label":["adresse"],"points":[{"start":2757,"end":2804,"text":"N141 Av. El Hansali Agharass \nBouargane \nAgadir "}]},{"label":["diplomes"],"points":[{"start":262,"end":369,"text":"2009-2010 :  Baccalauréat Scientifique, option : Sciences Physiques au Lycée Qualifiant \nIBN MAJJA à Agadir."}]},{"label":["diplomes"],"points":[{"start":125,"end":259,"text":"2010-2016 :  Diplôme d’Ingénieur d’Etat, option : Génie Informatique, à l’Ecole  \nNationale des Sciences Appliquées d’Agadir (ENSAA).  "}]}],"extras":null,"metadata":{"first_done_at":1586141779000,"last_updated_at":1586141779000,"sec_taken":0,"last_updated_by":"wP21IMXff9TFSNLNp5v0fxbycFX2","status":"done","evaluation":"NONE"}}


{"content": "Resume 4 text in french","annotation":[{"label":["diplomes"],"points":[{"start":505,"end":611,"text":"2012 Baccalauréat Sciences Expérimentales option Sciences Physiques, Lycée Hassan Bno \nTabit, Ouled Abbou. "}]},{"label":["diplomes"],"points":[{"start":375,"end":499,"text":"2012–2015 Diplôme de licence en Informatique et Gestion Industrielle, IGI, Faculté des sciences \net Techniques, Settat, LST. "}]},{"label":["diplomes"],"points":[{"start":272,"end":367,"text":"2015–2017 Master Spécialité BioInformatique et Systèmes Complexes, BISC, ENSA , Tanger, \n\nBac+5."}]},{"label":["adresse"],"points":[{"start":15,"end":71,"text":"246 Hay Pam Eljadid OULED ABBOU  \n26450 BERRECHID, Maroc "}]}],"extras":null,"metadata":{"first_done_at":1586127374000,"last_updated_at":1586327010000,"sec_taken":0,"last_updated_by":"wP21IMXff9TFSNLNp5v0fxbycFX2","status":"done","evaluation":"NONE"}}


{"content": "Resume 5 text in french","annotation":null,"extras":null,"metadata":{"first_done_at":1586139511000,"last_updated_at":1586139511000,"sec_taken":0,"last_updated_by":"wP21IMXff9TFSNLNp5v0fxbycFX2","status":"done","evaluation":"NONE"}}

Code that transformes this json data to spacy format


input_file="input.json"
output_file="output.json"


training_data = []
lines=[]
with open(input_file, 'r', encoding="utf8") as f:
    lines = f.readlines()

for line in lines:
    data = json.loads(line)
    print(data)
    text = data['content']
    entities = []
    for annotation in data['annotation']:
        point = annotation['points'][0]
        labels = annotation['label']
        if not isinstance(labels, list):
            labels = [labels]

        for label in labels:
            entities.append((point['start'], point['end'] + 1 ,label))


    training_data.append((text, {"entities" : entities}))


with open(output_file, 'wb') as fp:
    pickle.dump(training_data, fp)

Code for training the spacy model

def train_spacy():
    TRAIN_DATA = training_data
    nlp = spacy.load('fr_core_news_md')  # create blank Language class
    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    # if 'ner' not in nlp.pipe_names:
    #     ner = nlp.create_pipe('ner')
    #     nlp.add_pipe(ner, last=True)

    ner = nlp.get_pipe("ner")

    # add labels
    for _, annotations in TRAIN_DATA:
         for ent in annotations.get('entities'):
            ner.add_label(ent[2])

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(20):
            print("Statring iteration " + str(itn))
            random.shuffle(TRAIN_DATA)
            losses = {}
            for text, annotations in TRAIN_DATA:
                nlp.update(
                    [text],  # batch of texts
                    [annotations],  # batch of annotations
                    # drop=0.2,  # dropout - make it harder to memorise data
                    sgd=optimizer,  # callable to update weights
                    losses=losses)
            print(itn, dt.datetime.now(), losses)

    output_dir = "new-model"
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.meta['name'] = "addr_edu"  # rename model
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

train_spacy()

When I test the model this is what happens

import spacy
nlp = spacy.load("new-model")

doc = nlp("Text of a Resume already trained on")
print(doc.ents)
# It prints out this ()

doc = nlp("Text of a Resume not trained on")
print(doc.ents)
# It prints out this ()

What I expect it to give me is the entities adresse(address) and diplomes(academic degrees) present in the text

Edit 1

The sample data(input.json) in the very top is part of the data that I get after I annotate resumes on a text annotation platform.

But I should transform it to spacy format, so I can give to model for training.

This is what a resume with annotations looks like when I give it to the model

training_data = [(
    'Dr.XXXXXX XXXXXXX                                  \n\n \nEmail  : XXXXXXXXXXXXXXXXXXXXXXXX \n\nGSM   : XXXXXXXXXX \n\nAdresse : XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX \n \n\n \n\nETAT CIVIL \n \n\nSituation de famille : célibataire  \n\nNationalité              : Marocaine \n\nNé le                        : 10 février 1983 \n\nLieu de naissance   : XXXXXXXXXXXXXXXX \n\n \n FORMATION \n\n• 2014-2017 : Institut National des Postes et Télécommunications INPT                 Rabat                                           \n                             Thèse de doctorat en informatique et télécommunication  \n \n\n• 2002-2004 : Faculté des Sciences                                                                           Rabat  \nDESA ((Diplôme des Etudes Supérieures Approfondies)  en informatique   \n\ntélécommunication multimédia \n \n\n• 1999-2000 : Faculté des Sciences                                                                           Rabat \n                            Licence es sciences physique  option électronique  \n \n\n•  1996-1998 : Faculté des Sciences                                                                          Rabat \n                  C.E.U.S (Certificat des Etudes universitaires Supérieurs) option physique et chimie \n \n\n• 1995-1996 : Lycée  Dar Essalam                                                                     Rabat     \n                        Baccalauréat scientifique option sciences Expérimentales \n\nSTAGE  DE FORMATION \n\n• Du 03/03/2004  au 17/09/2004 : Stage de Projet de Fin d’Etudes à l’ INPT  pour  \nl’obtention du  DESA                (Diplôme des Etudes Supérieures Approfondies). \n\n                                  Sujet : AGENT RMON DANS LA GESTION DE RESEAUX. \n\n• Du 03/06/2002  au 17/01/2003: Stage de Projet de Fin d’année à INPT \n  Sujet : Mécanisme d’Authentification Kerbéros Dans un Réseau Sans fils sous Redhat. \n\nPUBLICATION  \n\n✓ Ababou, Mohamed, Rachid Elkouch, and Mostafa Bellafkih and Nabil Ababou. "New \n\nstrategy to optimize the performance of epidemic routing protocol." International Journal \n\nof Computer Applications, vol. 92, N.7, 2014.  \n\n✓ Ababou, Mohamed, Rachid Elkouch, and Mostafa Bellafkih and Nabil Ababou. "New \n\nStrategy to optimize the Performance of Spray and wait Routing Protocol." International \n\nJournal of Wireless and Mobile Networks v.6, N.2, 2014. \n\n✓ Ababou, Mohamed, Rachid Elkouch, and Mostafa Bellafkih and Nabil Ababou. "Impact of \n\nmobility models on Supp-Tran optimized DTN Spray and Wait routing." International \n\njournal of Mobile Network Communications & Telematics ( IJMNCT), Vol.4, N.2, April \n\n2014. \n\n✓ M. Ababou, R. Elkouch, M. Bellafkih and N. Ababou, "AntProPHET: A new routing \n\nprotocol for delay tolerant networks," Proceedings of 2014 Mediterranean Microwave \n\nSymposium (MMS2014), Marrakech, 2014, IEEE. \n\nmailto:XXXXXXXXXXXXXXXXXXXXXXXX\n\n\n✓ Ababou, Mohamed, et al. "BeeAntDTN: A nature inspired routing protocol for delay \n\ntolerant networks." Proceedings of 2014 Mediterranean Microwave Symposium \n\n(MMS2014). IEEE, 2014. \n\n✓ Ababou, Mohamed, et al. "ACDTN: A new routing protocol for delay tolerant networks \n\nbased on ant colony." Information Technology: Towards New Smart World (NSITNSW), \n\n2015 5th National Symposium on. IEEE, 2015. \n\n✓ Ababou, Mohamed, et al. "Energy-efficient routing in Delay-Tolerant Networks." RFID \n\nAnd Adaptive Wireless Sensor Networks (RAWSN), 2015 Third International Workshop \n\non. IEEE, 2015. \n\n✓ Ababou, Mohamed, et al. "Energy efficient and effect of mobility on ACDTN routing \n\nprotocol based on ant colony." Electrical and Information Technologies (ICEIT), 2015 \n\nInternational Conference on. IEEE, 2015. \n\n✓ Mohamed, Ababou et al. "Fuzzy ant colony based routing protocol for delay tolerant \n\nnetwork." 10th International Conference on Intelligent Systems: Theories and Applications \n\n(SITA). IEEE, 2015. \n\nARTICLES EN COURS DE PUBLICATION \n\n✓ Ababou, Mohamed, Rachid Elkouch, and Mostafa Bellafkih and Nabil Ababou.”Dynamic \n\nUtility-Based Buffer Management Strategy for Delay-tolerant Networks. “International \n\nJournal of Ad Hoc and Ubiquitous Computing, 2017. ‘accepté par la revue’ \n\n✓ Ababou, Mohamed, Rachid Elkouch, and Mostafa Bellafkih and Nabil Ababou. "Energy \n\nefficient routing protocol for delay tolerant network based on fuzzy logic and ant colony." \n\nInternational Journal of Intelligent Systems and Applications (IJISA), 2017. ‘accepté par la \n\nrevue’ \n\nCONNAISSANCES EN INFORMATIQUE \n\n  \n\nLANGUES \n\nArabe,  Français, anglais. \n\nLOISIRS ET INTERETS PERSONNELS \n\n \n\nVoyages, Photographie, Sport (tennis de table, footing), bénévolat. \n\nSystèmes :  UNIX, DOS, Windows  \n\nLangages :  Séquentiels ( C, Assembleur), Requêtes (SQL), WEB (HTML, PHP, MySQL, \n\nJavaScript), Objets (C++, DOTNET,JAVA) , I.A. (Lisp, Prolog) \n\nLogiciels :  Open ERP (Enterprise Resource Planning), AutoCAD, MATLAB, Visual \n\nBasic, Dreamweaver MX. \n\nDivers :  Bases de données, ONE (Opportunistic Network Environment), NS3,  \n\nArchitecture réseaux,Merise,... \n\n',
    {'entities': [(1233, 1424, 'diplomes'), (1012, 1227, 'diplomes'), (812, 1005, 'diplomes'), (589, 806, 'diplomes'), (365, 583, 'diplomes'), (122, 158, 'adresse')]}
)]

I agree is better if we try to train the model just on one resume, and test with it to see if he learns.

I've changed the code, the difference is now I try to train a blank model.

def train_spacy():
    TRAIN_DATA = training_data
    nlp = spacy.blank('fr')
    ner = nlp.create_pipe("ner")
    nlp.add_pipe(ner, last=True)
    ner = nlp.get_pipe("ner")
    # add labels
    for _, annotations in TRAIN_DATA:
         for ent in annotations.get('entities'):
            ner.add_label(ent[2])


    optimizer = nlp.begin_training()
    for itn in range(20):
        random.shuffle(TRAIN_DATA)
        losses = {}
        for text, annotations in TRAIN_DATA:
            nlp.update(
                [text],  # batch of texts
                [annotations],  # batch of annotations
                drop=0.1,  # dropout - make it harder to memorise data
                sgd=optimizer,  # callable to update weights
                losses=losses
            )
        print(itn, dt.datetime.now(), losses)

    return nlp

Here are the losses I get in the training enter image description here

Here is the test, here I test the same resume used for training. enter image description here

The good thing is now I dont have the empty tuple, the model actually recognized something correctly, in this case the "adresse" entity.

But I wont recognize the "diplomes" entity, which I web have 5 of them in this resume, even though its trained on it.

  • I think you want to do `nlp = French` instead of loading `'fr_core_news_md'`, which already has an NER component with labels you won't be interested in. – Sofie VL Apr 09 '20 at 08:06
  • Further, I think your data may be formatted wrongly, but it's difficult to tell as you blinded the texts. What I would propose is that you create one single artificial sentence with annotations, that you can share here, and run your training pipeline on that one example only. The loss should decrease, and at the end the prediction on that example should be perfect. You can take inspiration from this unit test that does exactly that: https://github.com/explosion/spaCy/blob/develop/spacy/tests/parser/test_ner.py#L280 – Sofie VL Apr 09 '20 at 08:10
  • @SofieVL Thanks, I did the changes, they are under "edit 1", its getting better, it recognized an entity, but not all of the entities present in resume, this is given that the test is on the same resume its train on, I tried to explain it under "edit 1", what do u think? –  Apr 09 '20 at 19:03
  • Thanks for the code snippet! That's very helpful. I can reproduce your results - diplomes are indeed not recognized correctly. I think this may be due to all the whitespace charachters inside the entities. Is there any chance you could clean that up in preprocessing? – Sofie VL Apr 30 '20 at 09:10
  • @MedAchraf i am experiencing the same problem, did you figure out how to solve it? ps: I am also working on French CVs / resumes. Also which tool did you use to create your train/test dataset? – BAKYAC Nov 29 '20 at 15:40

0 Answers0