Reading text that has parentheses

Question

I am attempting to read a .docx file into Python. The file is organized into two tables (it's messy), one with Chinese characters and the other with English. However, it seems that when I am reading the text from these tables, the parentheses do not show up.

I read the text from the .docx file as follows

import numpy as np
from docx import Document

doc = Document('2003 PPC for corpus.docx')

chinese_text = doc.tables[0].rows[0].cells[0].text
print(chinese_text)
english_text = doc.tables[0].rows[0].cells[1].text.encode('utf-8')
print(english_text)

These print statements then show

［］女士们，先生们，

and

b"Good morning ladies and gentlemen, we are very honor

My question is why am I not reading the characters inside the square brackets in the Chinese text. And why am I not reading the "(3)" at the start of the English text?

The image of the table isn't clear. Post a sample document containing the table. — Ilayaraja, Jan 23 '18 at 13:49
In my experience, I've found that the python-docx package is not 100% functional. While it works for most documents, it fails to capture some text, especially if those documents were sourced from a template, copied and pasted from other documents or sources, or a combination of both. The conclusion was to basically parse the document into XML and try to work from there. — Scratch'N'Purr, Jan 23 '18 at 13:58
@Scratch'N'Purr, I've been looking at some possible libraries to use to parse a .docx as xml however I havent found one that seems to work. I understand that a .docx is essentially an xml. But the Python xml library doesn't let me import it. "ParseError: not well-formed (invalid token): line 1, column 2" — JahKnows, Jan 23 '18 at 14:34

Reading text that has parentheses

0 Answers0