0

I have some HTML formatted text I've got with BeautifulSoup. I'd like to convert all italic (tag i), bold (b) and links (a href) to Word format via docx run command.

I can make a paragraph:

p = document.add_paragraph('text')

I can ADD next sequence as bold/italic:

p.add_run('bold').bold = True
p.add_run('italic.').italic = True

Intuitively, I could find all particular tags (ie. soup.find_all('i')) and then watch indices and then concatenate partial strings...

...but maybe there's a better, more elegant way?

I don't want libraries or solutions that just convert a html page to word and save them. I want a little more control.

I got nowhere with a dictionary. Here is the code and visual wrong (from code) and right (desired) result:

from docx import Document
import os
from bs4 import BeautifulSoup

html = '<a href="http://someurl.you">hi, I am link</a> this is some nice regular text. <i> oooh, but I am italic</i> ' \
        ' or I can be <b>bold</b> '\
        ' or even <i><b>bold and italic</b></i>'

def get_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    tags = {}
    tags["i"] = soup.find_all("i")
    tags["b"] = soup.find_all("b")

    return tags


def make_test_word():
    document = Document()

    document.add_heading('Demo HTML', 0)

    soup = BeautifulSoup(html, "html.parser")

    p = document.add_paragraph(html)

    # p.add_run('bold').bold = True
    # p.add_run(' and some ')
    # p.add_run('italic.').italic = True

    file_name="demo_html.docx"
    document.save(file_name)
    os.startfile(file_name)


make_test_word()

enter image description here

Jongware
  • 22,200
  • 8
  • 54
  • 100
okkko
  • 1,010
  • 1
  • 13
  • 22

2 Answers2

1

I just wrote a bit of code to convert the text from a tkinter Text widget over to a word document, including any bold tags that the user can add. This isn't a complete solution for you, but it may help you to start toward a working solution. I think you're going to have to do some regex work to get the hyperlinks transferred to the word document. Stacked formatting tags may also get tricky. I hope this helps:

from docx import Document

html = 'HTML string <b>here</b>.'

html = html.split('<')
html = [html[0]] + ['<'+l for l in html[1:]]
doc = Document()
p = doc.add_paragraph()
for run in html:
    if run.startswith('<b>'):
        run = run.lstrip('<b>')
        runner = p.add_run(run)
        runner.bold = True
    elif run.startswith('</b>'):
        run = run.lstrip('</b>')
        runner = p.add_run(run)
    else:
        p.add_run(run)
doc.save('test.docx')

I came back to it and made it possible to parse out multiple formatting tags. This will keep a tally of what formatting tags are in play in a list. At each tag, a new run is created, and formatting for the run is set by the current tags in play.

from docx import Document
import re
import docx
from docx.shared import Pt
from docx.enum.dml import MSO_THEME_COLOR_INDEX

def add_hyperlink(paragraph, text, url):
    # This gets access to the document.xml.rels file and gets a new relation id value
    part = paragraph.part
    r_id = part.relate_to(url, docx.opc.constants.RELATIONSHIP_TYPE.HYPERLINK, is_external=True)

    # Create the w:hyperlink tag and add needed values
    hyperlink = docx.oxml.shared.OxmlElement('w:hyperlink')
    hyperlink.set(docx.oxml.shared.qn('r:id'), r_id, )

    # Create a w:r element and a new w:rPr element
    new_run = docx.oxml.shared.OxmlElement('w:r')
    rPr = docx.oxml.shared.OxmlElement('w:rPr')

    # Join all the xml elements together add add the required text to the w:r element
    new_run.append(rPr)
    new_run.text = text
    hyperlink.append(new_run)

    # Create a new Run object and add the hyperlink into it
    r = paragraph.add_run ()
    r._r.append (hyperlink)

    # A workaround for the lack of a hyperlink style (doesn't go purple after using the link)
    # Delete this if using a template that has the hyperlink style in it
    r.font.color.theme_color = MSO_THEME_COLOR_INDEX.HYPERLINK
    r.font.underline = True

    return hyperlink

html = '<H1>I want to</H1> <u>convert HTML <a href="http://www.google.com">to docx</a> in <b>bold and <i>bold italic</i></b>.</u>'

html = html.split('<')
html = [html[0]] + ['<'+l for l in html[1:]]
tags = []
doc = Document()
p = doc.add_paragraph()
for run in html:
    tag_change = re.match('(?:<)(.*?)(?:>)', run)
    if tag_change != None:
        tag_strip = tag_change.group(0)
        tag_change = tag_change.group(1)
        if tag_change.startswith('/'):
            if tag_change.startswith('/a'):
                tag_change = next(tag for tag in tags if tag.startswith('a '))
            tag_change = tag_change.strip('/')
            tags.remove(tag_change)
        else:
            tags.append(tag_change)
    else:
        tag_strip = ''
    hyperlink = [tag for tag in tags if tag.startswith('a ')]
    if run.startswith('<'):
        run = run.replace(tag_strip, '')
        if hyperlink:
            hyperlink = hyperlink[0]
            hyperlink = re.match('.*?(?:href=")(.*?)(?:").*?', hyperlink).group(1)
            add_hyperlink(p, run, hyperlink)
        else:
            runner = p.add_run(run)
            if 'b' in tags:
                runner.bold = True
            if 'u' in tags:
                runner.underline = True
            if 'i' in tags:
                runner.italic = True
            if 'H1' in tags:
                runner.font.size = Pt(24)
    else:
        p.add_run(run)
doc.save('test.docx')

Hyperlink function thanks to this question. My concern here is that you will need to manually code for every HTML tag that you want to carry over to the docx. I imagine that could be a large number. I've given some examples of tags you may want to account for.

jimmiesrustled
  • 691
  • 5
  • 6
0

Alternatively, you can just save your html code as a string and do:

from htmldocx import HtmlToDocx

new_parser = HtmlToDocx()
new_parser.parse_html_file("html_filename", "docx_filename")
#Files extensions not needed, but tolerated
Synthase
  • 5,849
  • 2
  • 12
  • 34