2

I am attempting to read a .docx file into Python. The file is organized into two tables (it's messy), one with Chinese characters and the other with English. However, it seems that when I am reading the text from these tables, the parentheses do not show up.

enter image description here

I read the text from the .docx file as follows

import numpy as np
from docx import Document

doc = Document('2003 PPC for corpus.docx')

chinese_text = doc.tables[0].rows[0].cells[0].text
print(chinese_text)
english_text = doc.tables[0].rows[0].cells[1].text.encode('utf-8')
print(english_text)

These print statements then show

[]女士们,先生们,

and

b"Good morning ladies and gentlemen, we are very honor


My question is why am I not reading the characters inside the square brackets in the Chinese text. And why am I not reading the "(3)" at the start of the English text?

JahKnows
  • 2,618
  • 3
  • 22
  • 37
  • The image of the table isn't clear. Post a sample document containing the table. – Ilayaraja Jan 23 '18 at 13:49
  • In my experience, I've found that the python-docx package is not 100% functional. While it works for most documents, it fails to capture some text, especially if those documents were sourced from a template, copied and pasted from other documents or sources, or a combination of both. The conclusion was to basically parse the document into XML and try to work from there. – Scratch'N'Purr Jan 23 '18 at 13:58
  • @Scratch'N'Purr, I've been looking at some possible libraries to use to parse a .docx as xml however I havent found one that seems to work. I understand that a .docx is essentially an xml. But the Python xml library doesn't let me import it. "ParseError: not well-formed (invalid token): line 1, column 2" – JahKnows Jan 23 '18 at 14:34

0 Answers0