Docx don't read properly accented word in python

Asked Feb 06 '18 at 10:16

Active May 29 '18 at 16:49

Viewed 134 times

I've got a problem when trying to tokenize text using Moses tokenizer. The tokenizer is considering the accented word as 'é' or 'è' as spaces and special characters when tokenizing.

Steps :

-- > I read from .docx file

-- > Tokenize text with Moses tokenizer

from docx import Document
tokenizer = MosesTokenizer(lang='FR')

for i in file_docx.paragraphs:
    text = i.text
    tok = tokenizer.tokenize(text)
    print(text) 
    print(tok)

results : J'atteste que j'étais présent pour toute la procédure.

['J', '\\&apos;', 'atteste', 'que', 'j', '\\&apos;', 'e', '́', 'tais', 'pre', '́', 'sent', 'pour', 'toute', 'la', 'proce', '́', 'dure', '.']

edited May 29 '18 at 16:49

Alejandro Galera

3,445
3
24
42

asked Feb 06 '18 at 10:16

maiza hichem

The default encoding that the MosesTokenizer in NLTK expects is `utf-8`. Is there a way to set encoding for `docx.Document`? It looks like it's reading some `latin-1` encoding and then feeding it to the MosesTokenizer that expects `utf8`. – alvas Feb 07 '18 at 03:24

Docx don't read properly accented word in python

0 Answers0