0

I need to convert a word document into html code and then save it into a .txt file with lines of no longer than 100 characters (there's a process later on that won't pick up more than 255 characters if they're not in separate lines).

So far, I've successfully (though a better solution is welcome) managed to convert the .docx file into html and deploy that variable into a .txt file. However, I'm not able to figure out how to separate the lines. Is there any integrated function which could achieve this?

import mammoth

with open(r'C:\Users\uXXXXXX\Downloads\Test_Script.docx', "rb") as docx_file:
    result = mammoth.convert_to_html(docx_file)
    html = result.value # The generated HTML
    messages = result.messages # Any messages, such as warnings during conversion
    
with open(r'C:\Users\uXXXXXX\Downloads\Output.txt', 'w') as text_file:
    text_file.write(html)
Javi Torre
  • 724
  • 8
  • 23

1 Answers1

0

In that case, you can just do

html = "..."
i = 100
while i < len(html):
    html = html[:i] + "\n" + html[i:]
    i += 101
Captain Trojan
  • 2,800
  • 1
  • 11
  • 28
  • Thanks. I may need an additional question for this, but, is it possible to maintain the word color formatting when converting the word document to html? – Javi Torre Mar 12 '21 at 16:23
  • @JaviTorre Possibly. Docx is basically html with spice, so using a proper library, yes. However, the docx format is proprietary, so there is little to no documentation on the subject. You might want to ask an additional question for this. I just know how to insert characters into strings in python :D – Captain Trojan Mar 12 '21 at 16:46