Why python-docx?

Question

import docx
import collections
listofnames = list()
filename = 'Missing_Assignments.docx'
filehandle = docx.Document(filename)
studentinfo = filehandle.paragraphs

for student in studentinfo: 
    if len(student.text) > 1 or len(student.text) > 20:
        listofnames.append(student.text)

for name in listofnames: 
    if name.startswith('Assignment'):
        listofnames.remove(name)
    

counts = collections.Counter(listofnames)
counts = dict(counts)

filehandle.add_paragraph('\n')

for name,count in counts.items(): 
    filehandle.add_paragraph(name + ' ' + str(count))
    filehandle.save(filename)

print('Complete!')

More of a learning/efficiency question...if this is not generally considered appropriate please let me know what forums may be more suitable.

Question is, why do I have to use docx? I'm used to creating a simple handle like:

filehandle = open(filename)

And being able to iterate through a file this way. I was receiving all kinds of UNICODE errors before using python-docx libraries. Just seems slightly more complicated because I have to use their verbage as opposed to directly iterating through each line of text like I normally would.

Also, does anyone know of a way break off the counting function shown here? I want to count the amount of times a name appears for various missing assignments but only for that period. Other periods may have students with the same name so this would complicate the counting?

“*why do I have to use docx?*” Certainly nobody is forcing you to, but have you tried creating a simple file handle for a Word-formatted document and seeing what the results are? Hint: `docx` files are largely unworkable until parsed from their proprietary format. — esqew, May 05 '21 at 10:53
Yeah, when I tried I got a UNICODE error. I've always worked with .txt in tutorials as I'm still learning. I don't mind using docx, I just wanted to understand more about why it was necessary from a computer science standpoint. Thanks! — James Norris, May 05 '21 at 11:06

score 1 · Accepted Answer · answered May 05 '21 at 11:01

You should use python-docx when you have a docx file.

You can open a simple handle to parse a plain text file, but docx is not a plain text format.

It is actually a ZIP archive containing XML files. You can read more about that here: https://docs.fileformat.com/word-processing/docx/

You can create your own parser for that, the standard is actually open, but there are interoperability glitches. You can read more about it here: https://brattahlid.wordpress.com/2012/05/08/is-docx-really-an-open-standard/

To summarize, python-docx takes off the burden of parsing the file format for you.

I see. This was exactly what I was looking for, but I struggle to find this info from simple google searches (surprising to me). Thanks for the in depth explanation, really helps! — James Norris, May 05 '21 at 11:10

score 0 · Answer 2 · answered May 05 '21 at 10:58

A docx file is actually a zip file. Try renaming it to xyz.zip and unzipping it. You'll find a number of files in a number of folders, most of which are XML files. These XML files have a specific format, created by MS for docx files.

You could try to do all that directly in python (or whatever language you want) including what all the XML attributes and elements mean and how the different files relate to each other, or you could use a library which already knows that.

As to your second question you have presented no data which would allow me to even guess, so I won't.

Why python-docx?

2 Answers2