Why does lxml cut out one piece of an XML file?

Question

I am using pyspellchecker spell checking library in order to post-correct the OCR output of a text in French.

I use lxml for extracting only the raw text from a TEI-XML file, in order to apply the spell checker afterwards.

The corrections apply without a problem, but lxml cuts out a whole part of an XML file after a nested tag (here, after <hi rend="i">emblèmes</hi>), which means that:

qui, par le moyen des <hi rend="i">emblèmes</hi>,
explique ou repréfente la doétrine
des anciens temps fur les diverfes
opérations de la nature, fur les différents
états de la vie humaine, fur
les vertus &amp; fur les vices , fur
les sorts heureux ou malheureux.
Ainfi, par exemple, des montagnes
sous terre fignifîent l’humilité, &amp; la
difpolîtion ou la longueur de différentes
lignes combinées fervent à exprimer
les effets de cette vertu ( i).</p><p rend="small">(i) Notice de l’Y-king, par M. Vifdeîau,
à la fin de la traduction du Chufcing.</p>

when parsed and transformed into a .txt file, becomes:

qui par le moyen des
emblèmes
(i) notice de l’y-king par m vifdeîau à la fin de la traduction du chancing

So, the whole explique ou représente [...] ( i). part is missing.

How to restore it?

Python code:

import os, re, glob, csv
from spellchecker import SpellChecker
from lxml import etree
from collections import Counter 

# ignore hidden files in the directory with the input XML files (e.g. '._5419000_r.xml') 
def listdir_nohidden(path):
    return glob.glob(os.path.join(path, '*'))

# spécify the input files to be corrected 
directory_in = listdir_nohidden("./sample_in/")

# remove the .xml extension 
for file_in in directory_in:
    tree = etree.parse(file_in)
    root = tree.getroot()
    file_in = os.path.basename(file_in)
    file_in = os.path.splitext(file_in)[0]
    # print(file_in) # 5419000_r, test
    
    # create new .txt files on which the corrections will be applied
    file_out = '{}'.format(file_in)+'.txt'
    # print(file_out) # 5419000_r.txt, test.txt
    directory_out = os.path.join("./sample_out/", file_out)
  
    # create new .csv files with the errors, corrections and error frequencies
    corr_out = os.path.join('./csv/', file_in+'.csv')
     
    # remove special characters
    car_spec = ['■', '•', '%', '*', '#', '+', '^', '\\', '$', '>', '<', '£', '{', '}'] 
    
    # generate a .csv sheet
    with open(directory_out, 'w') as f, open(corr_out, 'w') as fout:
        writer = csv.writer(fout)
        writer.writerow(["Erreur"'\t' "Correction"'\t' "Fréquence"'\t']) 
        
        # remove the XML tags in order to get the text only
        for elem in root.iter('*'):
            if elem.text is not None:
                text = elem.text.strip()
                if text: 
                    for c in car_spec:
                        text = text.replace(c,'')
                    
                    # preprocessing
                    text = re.sub('&', 'et', text) 
                    text = re.sub('« \n', '', text) # concatenate the words separated by the hyphen, represented as a quotation mark 
                                                    
                    text = re.sub(" +", " ", text)  # reduce the multiple spaces into one simple space
                    text = text.lower() # lowercase the text 
                    text = text.replace("\n", " ") # so that each line starts from the very beginning, and not after a space
                                                   
                    # remplace the quotation marks in order to avoid the parsing problem
                    text = text.replace("'", "’") 
                    
                    # delete space before certain spécial characters
                    text = text.replace(' ,', ',') 
                    text = text.replace(' .','. ')
                    text = text.replace(' :',':')
                    text = text.replace(' ;',';')
                    text = text.replace(' !','!')
                    text = re.sub('\s\?','?', text)
                    text = text.replace(' "','"')
                    text = text.replace('( ','(')
                    text = text.replace(' )',')') 
                    text = text.replace(' –','-')
                    
                    # remplace long and middle dashes with a short one 
                    text = text.replace('–', '-') 
                    
                    # remote the punctuation marks at the end of a token because 
                    # they prevent the corrector from correcting the sequence 
                    # 'token + punctuation mark ', even if the token is indeed written incorrectly
                    # e.g.: 'jeuneffe,' (with comma) > 'jeuneffe' (incorrect)
                    # instead of 'jeuneffe' (without comma) > 'jeunesse' (correct)
                    text = re.sub('(?<=\w)[,;:?!.]', '', text)
                    
                    # define the French spell checker 
                    # pyspellchecker
                    spell = SpellChecker(language='fr')

                    # tokenise the texte with the standard tokeniser (e.g.: 'l'empire')
                    # because the pyspellchecker's tokenise badly (e.g.: 'l', 'empire')
                    token_list = text.split()

                    for t in token_list:
                    # do not correct neither the tokens with the apostrophe (e.g. : l’empire, d’art, s’étend...)
                    # nor those in the parentheses (e.g. : (1716-1790))
                        r1 = re.findall(r"(l’\w+|l’\w+-\w+|d’\w+|d’\w+|qu’\w+|c’\w+|n’\w+|j’\w+|lorfqu’\w+|eft|\w+.*?\)|\(.*?.\)|\(.*$)", t)
                        spell.word_frequency.load_words(r1)
                        a = spell.known(r1)  # les mots {'e.g. : l’empire, d’art, s’étend'} are non 
                                             # in the dictionary of correct words
                        
                    # correct the tokens in the .txt file
                    # extract the errors, their frequencies and their corrections in a .csv
                    misspelled = spell.unknown(token_list)
                 
                    for m in misspelled:
                        corrected = spell.correction(m)
                        if m in token_list:
                            m_freq = token_list.count(m)
                            # print(m_freq)
                        # print(m, corrected, str(m_freq))
                        text = text.replace(m, corrected)
                        # f.write(c.replace('clafliques', 'classiques'))

                        fout.write(m+'\t' + corrected+'\t' + str(m_freq)+' \n')
                    # print(text)
                    f.write(text + "\n")

Input XML:

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="fr" n="5419000" xml:id="cb30263946g">
  <teiHeader>
<fileDesc>
<titleStmt>
<title>Les livres classiques de l'empire de la Chine</title>
<author role="Auteur du texte" key="11909957">Confucius (0551?-0479? av. J.-C.)</author>
<respStmt>
  <resp key="40">Annotateur</resp>
  <name key="12176450">Pluquet, François-André-Adrien (1716-1790)</name>
</respStmt>
<respStmt>
  <resp key="680">Traducteur</resp>
  <name key="16653645">Noël, François (1651-1729)</name>
</respStmt>
</titleStmt>
<publicationStmt>
<publisher>TGB (BnF – OBVIL)</publisher>
</publicationStmt>
<seriesStmt>
<title level="s">Les livres classiques de l'empire de la Chine</title>
<title level="a">Tome 2</title>
<biblScope unit="volumes" n="6"/>
<idno>cb30263946g</idno>
</seriesStmt>
<sourceDesc>
<bibl>
<idno>http://gallica.bnf.fr/ark:/12148/bpt6k54190001</idno>
<publisher>Barrois aîné et Barrois jeune</publisher>
<date when="1784">1784</date>
</bibl>
</sourceDesc>
</fileDesc>
</teiHeader>
  <text>
    <body><pb xml:id="PAG_00000001" n="" corresp="http://gallica.bnf.fr/ark:/12148/bpt6k54190001/f1.image"/>
<pb xml:id="PAG_00000002" n="" corresp="http://gallica.bnf.fr/ark:/12148/bpt6k54190001/f2.image"/>
<pb xml:id="PAG_00000003" n="" corresp="http://gallica.bnf.fr/ark:/12148/bpt6k54190001/f3.image"/>
<pb xml:id="PAG_00000004" n="" corresp="http://gallica.bnf.fr/ark:/12148/bpt6k54190001/f4.image"/><div><head>Livres classiques</head><p rend="left">
DE L’EMPIRE .
</p></div><div><head>De la chine.</head><pb xml:id="PAG_00000005" n="" corresp="http://gallica.bnf.fr/ark:/12148/bpt6k54190001/f5.image"/>
<pb xml:id="PAG_00000006" n="" corresp="http://gallica.bnf.fr/ark:/12148/bpt6k54190001/f6.image"/>
<pb xml:id="PAG_00000007" n="" corresp="http://gallica.bnf.fr/ark:/12148/bpt6k54190001/f7.image"/>
<pb xml:id="PAG_00000008" n="" corresp="http://gallica.bnf.fr/ark:/12148/bpt6k54190001/f8.image"/></div><div><head>Observations</head><p rend="left small">SUR</p><p rend="center small">LES LIVRES CLASSIQUES</p><p rend="center small">DE L’EMPIRE</p><p rend="center small">DE LA CHINE.</p><p rend="small">.LES Chinois ont deux sortes de
livres clafliques ou canoniques : les
Kings, ou les livres canoniques du
premier ordre ; &amp; les Ssée-chu, ou
livres canoniques dusecond ordre.</p><p rend="small">Les Kings sont au nombre de
cinq ; l’Y-king, le Chu-king,lc
Chi-king, le Tchun-tfiou &amp; le Lild.</p><p rend="left small">L’Y-king remonte à la plus haute
<hi rend="i">Tome II. a</hi></p><p rend="left"><hi rend="i">'\</hi>
<pb xml:id="PAG_00000009" n="" corresp="http://gallica.bnf.fr/ark:/12148/bpt6k54190001/f9.image"/>ij O B S E K.VATI ON S.</p><p rend="small">antiquité ; on l’attribue en grande
partie à Fo - hi : c’eft un ouvrage
qui, par le moyen des <hi rend="i">emblèmes</hi>,
explique ou repréfente la doétrine
des anciens temps fur les diverfes
opérations de la nature, fur les différents
états de la vie humaine, fur
les vertus &amp; fur les vices , fur
les sorts heureux ou malheureux.
Ainfi, par exemple, des montagnes
sous terre fignifîent l’humilité, &amp; la
difpolîtion ou la longueur de différentes
lignes combinées fervent à exprimer
les effets de cette vertu ( i).</p><p rend="small">(i) Notice de l’Y-king, par M. Vifdeîau,
à la fin de la traduction du Chufcing.</p>
</div></body>
  </text>
</TEI>

The output text:

les livres classiques de l’empire de la chine
confucius (0551-0479 av j-c)
innovateur
paquet françois-andré-adrien (1716-1790)
traducteur
noël françois (1651-1729)
rgb (bnf- obvil)
les livres classiques de l’empire de la chine
tome 2
cb30263946g
http//gallicabnffr/ark/12148/bpt6k54190001
barrons aîné et barrons jeune
1784
livres classiques
de l’empire 
de la chine
observations
sur
les livres classiques
de l’empire
de la chine
les chinois ont deux sortes de livres classiques ou canonique les kings ou les livres canonique du premier ordre et les ssée-chu ou livres canonique second ordre
les kings sont au nombre de cinq l’y-king le chu-kinglc thinking le tchun-tfiou et le lily
l’y-king remonte à la plus haute
tome ii a
a
antiquité on l’attribue en grande partie à fo - hi c’eft un ouvrage qui par le moyen des
emblèmes
(i) notice de l’y-king par m vifdeîau à la fin de la traduction du chancing

The text you want is in the `.tail` attribute of the element that precedes it. It's not cut out at all; your code is just ignoring it when it looks at nothing but `.text`. — Charles Duffy, Jul 26 '21 at 14:41
BTW, in the future, please try to make [mre]s more minimal. The shortest possible code, and sample data, strictly necessary to reproduce this problem when run without changes are much shorter than what's given here. — Charles Duffy, Jul 26 '21 at 14:45

Why does lxml cut out one piece of an XML file?

0 Answers0