I've got a problem when trying to tokenize text using Moses tokenizer. The tokenizer is considering the accented word as 'é' or 'è' as spaces and special characters when tokenizing.
Steps :
- -- > I read from .docx file
-- > Tokenize text with Moses tokenizer
from docx import Document tokenizer = MosesTokenizer(lang='FR') for i in file_docx.paragraphs: text = i.text tok = tokenizer.tokenize(text) print(text) print(tok)
results : J'atteste que j'étais présent pour toute la procédure.
['J', '\\'', 'atteste', 'que', 'j', '\\'', 'e', '́', 'tais', 'pre', '́', 'sent', 'pour', 'toute', 'la', 'proce', '́', 'dure', '.']