1

I am trying to parse a docx folder and take specific elements base on wether or not a certain word is bolded. If this is the text in the document:

Foo: Hello

Boo: Blah Blah

•Blah

•Blah

Choo: Hello

I would want to scan, line by line, and take all the text after the bolded word until the next bolded word.

As of right now I am using using an XML parser that parses based on newline charactrs. I cannot find anything in the Zipfile or the individual lines that would give me metadata like that.

Is it possible to do this?

Matt
  • 641
  • 1
  • 7
  • 10
  • 2
    You are not looking for "file parsing in Python with Formatting" but rather for "Docx content and formatting extraction in python" or something similar. Did you look at [python-docx](https://github.com/mikemaccana/python-docx/)? – niko Jun 29 '12 at 16:07

1 Answers1

0

I'd use a higher-level library that supports reading docx files rather than parsing the XML document.

One library that looks up to the task is python-docx.

If you're using Jython, Apache POI HWPF is another option.

IceArdor
  • 1,961
  • 19
  • 20