1

I've got a problem when trying to tokenize text using Moses tokenizer. The tokenizer is considering the accented word as 'é' or 'è' as spaces and special characters when tokenizing.

Steps :

  1. -- > I read from .docx file
  2. -- > Tokenize text with Moses tokenizer

    from docx import Document
    tokenizer = MosesTokenizer(lang='FR')
    
    for i in file_docx.paragraphs:
        text = i.text
        tok = tokenizer.tokenize(text)
        print(text) 
        print(tok)
    

results : J'atteste que j'étais présent pour toute la procédure.

['J', '\\'', 'atteste', 'que', 'j', '\\'', 'e', '́', 'tais', 'pre', '́', 'sent', 'pour', 'toute', 'la', 'proce', '́', 'dure', '.']
Alejandro Galera
  • 3,445
  • 3
  • 24
  • 42
  • The default encoding that the MosesTokenizer in NLTK expects is `utf-8`. Is there a way to set encoding for `docx.Document`? It looks like it's reading some `latin-1` encoding and then feeding it to the MosesTokenizer that expects `utf8`. – alvas Feb 07 '18 at 03:24

0 Answers0