How to identify superscript and/or subscript in text in Python

Question

I have a document in which I have to extract string associated with superscript or subscript in Python. I have explored docx library where I can add superscript and subscript but I am wondering how am I suppose to extract such strings. I have already googled but could not find any good solution.

from docx import Document
document = Document()

p = document.add_paragraph('Normal text with ')

super_text = p.add_run('superscript text')
super_text.font.superscript = True

p.add_run(' and ')

sub_text = p.add_run('subscript text')
sub_text.font.subscript = True

document.save('test.docx')

Yes. I have used [link](https://stackoverflow.com/questions/40747398/how-to-add-text-in-superscript-or-subscript-with-python-docx) for creating document with superscript. But I have not found anything for identifying superscript. — abhijeet, Oct 24 '17 at 18:29

score 0 · Answer 1 · answered Mar 20 '18 at 08:25

You can try to convert the docx file to xml first. Then use regex to capture the superscript and subscript value.

Here's an example

import re
import zipfile

try:
    from xml.etree.cElementTree import XML
except ImportError:
    from xml.etree.ElementTree import XML


def get_docx_xml(path):
    """Take the path of a docx file as argument, return the text in unicode."""
    document = zipfile.ZipFile(path)
    xml_content = document.read('word/document.xml')
    document.close()
    return xml_content


def get_superscript_subscript(xml_content):
    """Returns a dictionary with a value of list of superscipt and subscript."""
    superscript = re.findall('<w:vertAlign w:val="superscript"\/><w:lang w:val="[\S\s]*?"\/><\/w:rPr><w:t>([\S]+)<\/w:t><\/w:r>[\s\S]*?<w:t xml:space="preserve">([\s]*[\S]*)[\s\S]*?<\/w:t><\/w:r>', xml_content)
    subscript = re.findall('<w:vertAlign w:val="subscript"\/><w:lang w:val="[\S\s]*?"\/><\/w:rPr><w:t>([\S]+)<\/w:t><\/w:r>[\s\S]*?<w:t xml:space="preserve">([\s]*[\S]*)[\s\S]*?<\/w:t><\/w:r>', xml_content)
    return {"superscript": superscript, "subscript": subscript}

if __name__ == '__main__':
    xml_content = get_docx_xml(<docx_file_path>)
    superscripts_subscripts = get_superscript_subscript(xml_content)

The output will be like this - a dictionary with a value of list of tuple items: The first one is the superscript/subscript and the second one is the first word after.

{'subscript': [('28', ')'), ('28', ' score'), ('28', ' were'), ('28', ' sum'), ('28', ' and'), ('28', ' score'), ('28', ')')], 'superscript': [('28', ')'), ('28', ' score'), ('28', ' were'), ('28', ' sum'), ('28', ' and'), ('28', ' score'), ('28', ')')]}

How to identify superscript and/or subscript in text in Python

1 Answers1