Docx content and formatting extraction in python

Question

I am trying to parse a docx folder and take specific elements base on wether or not a certain word is bolded. If this is the text in the document:

Foo: Hello

Boo: Blah Blah

•Blah

Choo: Hello

I would want to scan, line by line, and take all the text after the bolded word until the next bolded word.

As of right now I am using using an XML parser that parses based on newline charactrs. I cannot find anything in the Zipfile or the individual lines that would give me metadata like that.

Is it possible to do this?

You are not looking for "file parsing in Python with Formatting" but rather for "Docx content and formatting extraction in python" or something similar. Did you look at [python-docx](https://github.com/mikemaccana/python-docx/)? — niko, Jun 29 '12 at 16:07

score 0 · Answer 1 · answered Oct 20 '13 at 10:13

0

I'd use a higher-level library that supports reading docx files rather than parsing the XML document.

One library that looks up to the task is python-docx.

If you're using Jython, Apache POI HWPF is another option.

answered Oct 20 '13 at 10:13

IceArdor

1,961
19
20

Docx content and formatting extraction in python

1 Answers1