I am trying to POS_TAG French using the Hugging Face Transformers library. In English I was able to do so given a sentence like e.g:
The weather is really great. So let us go for a walk.
the result is:
token feature
0 The DET
1 weather NOUN
2 is AUX
3 really ADV
4 great ADJ
5 . PUNCT
6 So ADV
7 let VERB
8 us PRON
9 go VERB
10 for ADP
11 a DET
12 walk NOUN
13 . PUNCT
Does anyone have an idea how a similar thing could be achieved for French?
This is the code I used for the English version in a Jupyter notebook:
!git clone https://github.com/bhoov/spacyface.git
!python -m spacy download en_core_web_sm
from transformers import pipeline
import numpy as np
import pandas as pd
nlp = pipeline('feature-extraction')
sequence = "The weather is really great. So let us go for a walk."
result = nlp(sequence)
# Just displays the size of the embeddings. The sequence
# In this case there are 16 tokens and the embedding size is 768
np.array(result).shape
import sys
sys.path.append('spacyface')
from spacyface.aligner import BertAligner
alnr = BertAligner.from_pretrained("bert-base-cased")
tokens = alnr.meta_tokenize(sequence)
token_data = [{'token': tok.token, 'feature': tok.pos} for tok in tokens]
pd.DataFrame(token_data)
The output of this notebook is above.