2

I am trying to POS_TAG French using the Hugging Face Transformers library. In English I was able to do so given a sentence like e.g:

The weather is really great. So let us go for a walk.

the result is:

    token   feature
0   The     DET
1   weather NOUN
2   is      AUX
3   really  ADV
4   great   ADJ
5   .       PUNCT
6   So      ADV
7   let     VERB
8   us      PRON
9   go      VERB
10  for     ADP
11  a       DET
12  walk    NOUN
13  .       PUNCT

Does anyone have an idea how a similar thing could be achieved for French?

This is the code I used for the English version in a Jupyter notebook:

!git clone https://github.com/bhoov/spacyface.git
!python -m spacy download en_core_web_sm

from transformers import pipeline
import numpy as np
import pandas as pd

nlp = pipeline('feature-extraction')
sequence = "The weather is really great. So let us go for a walk."
result = nlp(sequence)
# Just displays the size of the embeddings. The sequence
# In this case there are 16 tokens and the embedding size is 768
np.array(result).shape

import sys
sys.path.append('spacyface')

from spacyface.aligner import BertAligner

alnr = BertAligner.from_pretrained("bert-base-cased")
tokens = alnr.meta_tokenize(sequence)
token_data = [{'token': tok.token, 'feature': tok.pos} for tok in tokens]
pd.DataFrame(token_data)

The output of this notebook is above.

gil.fernandes
  • 12,978
  • 5
  • 63
  • 76
  • Currently I'm not really sure about what you want to know. Do you want to adjust `result` or you pandas dataframe related code to be able to handle french? – cronoik Jul 07 '20 at 20:17
  • @cronoik yes, the intent is to perform feature extraction using French text. The pandas dataframe is actually a distraction here. Given a sentence like " Aujourd'hui, il fait vraiment beau" I want to recognize "fait" as a verb, "beau" adjective and so on. I have tried the model Camembert (https://huggingface.co/models?search=camembert) but the results were not that good. – gil.fernandes Jul 08 '20 at 08:03
  • 1
    Maybe I'm wrong, but I wouldn't call that feature extraction. I would call it POS tagging which requires a `TokenClassificationPipeline`. As far as I know huggingface doesn't have a [pretrained model](https://huggingface.co/models?filter=french) for that task, but you can finetune a camenbert model with [run_ner](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner.py). – cronoik Jul 08 '20 at 08:22
  • @cronoik sorry about using the wrong terminology (I am new to this area and the NLP jargon is somehow confusing to me). The code you have above in the question seems to work well in terms of POS tagging for English. I just wanted the same thing for French. I have used a smal library that aligns Huggingface Transformer model tokenizations with linguistic metadata provided by spaCy: https://github.com/bhoov/spacyface But somehow this library does not support other languages other than English. – gil.fernandes Jul 08 '20 at 08:58
  • 1
    Well the problem is that you always need a trained model for that. Camenbert is trained but not for the downstream task POS tagging. Therefore you need to finetune it by yourself. In case you are not bound to huggingface you can look for POS tagging french. There are a plenty of solutions avaiable (for example [link](https://stackoverflow.com/questions/44468300/how-to-pos-tag-a-french-sentence)), but I can't tell you anything about their performance. – cronoik Jul 08 '20 at 09:03
  • @cronoik many thanks for that hint and also helping me to use the right terminology. I have updated the question and also will look at alternative solutions. Still I would like to know how you can perform this task with Hugging Face Transformers, so the question is still valid to me. – gil.fernandes Jul 08 '20 at 09:13
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/217444/discussion-between-cronoik-and-gil-fernandes). – cronoik Jul 08 '20 at 09:31

1 Answers1

7

We have ended up training a model for POS Tagging (part of speech tagging) with the Hugging Face Transformers library. The resulting model is available here:

https://huggingface.co/gilf/french-postag-model?text=En+Turquie%2C+Recep+Tayyip+Erdogan+ordonne+la+reconversion+de+Sainte-Sophie+en+mosqu%C3%A9e

You can basically see how it assigns POS tags on the webpage mentioned above. If you have the Hugging Face Transformers library installed you can try it out in a Jupyter notebook with this code:

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("gilf/french-postag-model")
model = AutoModelForTokenClassification.from_pretrained("gilf/french-postag-model")

nlp_token_class = pipeline('ner', model=model, tokenizer=tokenizer, grouped_entities=True)
nlp_token_class('En Turquie, Recep Tayyip Erdogan ordonne la reconversion de Sainte-Sophie en mosquée')

This is the result on the console:

[{'entity_group': 'PONCT', 'score': 0.11994100362062454, 'word': '[CLS]'},
{'entity_group': 'P', 'score': 0.9999570250511169, 'word': 'En'}, 
{'entity_group': 'NPP', 'score': 0.9998692870140076, 'word': 'Turquie'},
{'entity_group': 'PONCT', 'score': 0.9999769330024719, 'word': ','},
{'entity_group': 'NPP',   'score': 0.9996993020176888,  'word': 'Recep Tayyip Erdogan'},
{'entity_group': 'V', 'score': 0.9997997283935547, 'word': 'ordonne'},  
{'entity_group': 'DET', 'score': 0.9999586343765259, 'word': 'la'},
{'entity_group': 'NC', 'score': 0.9999251365661621, 'word': 'reconversion'},  
{'entity_group': 'P', 'score': 0.9999709129333496, 'word': 'de'},
{'entity_group': 'NPP', 'score': 0.9985082149505615, 'word': 'Sainte'},  
{'entity_group': 'PONCT', 'score': 0.9999614357948303, 'word': '-'},
{'entity_group': 'NPP', 'score': 0.9461128115653992, 'word': 'Sophie'},
{'entity_group': 'P', 'score': 0.9999079704284668, 'word': 'en'},
{'entity_group': 'NC', 'score': 0.8998225331306458, 'word': 'mosquée [SEP]'}]
gil.fernandes
  • 12,978
  • 5
  • 63
  • 76