Python -- How to split headers/chapters into separate files automatically

Question

I'm converting text directly to epub and I'm having a problem automatically splitting the HTML book file into separate header/chapter files. At the moment, the code below partially works but only creates every other chapter file. So half the header/chapter files are missing from the output. Here is the code:

def splitHeaderstoFiles(fpath):

infp = open(fpath, 'rt', encoding=('utf-8'))
for line in infp:

    # format and split headers to files
    if '<h1' in line:   

       #-----------format header file names and other stuff ------------#

        # create a new file for the header/chapter section
        path = os.getcwd() + os.sep + header
        with open(path, 'wt', encoding=('utf-8')) as outfp:            

            # write html top meta headers
            outfp = addMetaHeaders(outfp)
            # add the header
            outfp = outfp.write(line)

            # add the chapter/header bodytext
            for line in infp:
                if '<h1' not in line:
                    outfp.write(line)
                else:                     
                    outfp.write('</body>\n</html>')         
                    break                
    else:          
        continue

infp.close()

The problem occurs in the second 'for loop' at the bottom of the code, when I look for the next h1 tag to stop the split. I cannot use seek() or tell() to rewind or move back one line so the program can find the next header/chapter on the next iteration. Apparently you cannot use these in python in a for loop containing an implicit iter or next object in operation. Just gives a 'can't do non-zero cur-relative seeks' error.

I've also tried the while line != ' ' + readline() combination in the code which also gives the same error as above.

Does anyone know an easy way to split HTML headers/chapters of varying lengths into separate files in python? Are there any special python modules(such as pickles) that could help make this task easier?

I'm using Python 3.4

My grateful thanks in advance for any solutions to this problem...

For parsing html you can use libraries such as Beautiful Soup http://www.crummy.com/software/BeautifulSoup/bs4/doc/ — Saeid, Nov 23 '15 at 00:06
Thanks for that advice. I'll try that module and get back to you.. — Bill Thompson, Nov 23 '15 at 00:34
https://github.com/vporton/htmlsplit splits an XHTML file into chapters and generates ToC (not Python however) — porton, May 25 '16 at 23:20

score 2 · Answer 1 · answered Nov 23 '15 at 00:38

I ran into similar problem a while ago, here is a simplified solution:

from itertools import count

chapter_number = count(1)
output_file = open('000-intro.html', 'wb')

with open('index.html', 'rt') as input_file:
    for line in input_file:
        if '<h1' in line:
            output_file.close()
            output_file = open('{:03}-chapter'.format(next(chapter_number)), 'wb')
        output_file.write(line)

output_file.close()

In this approach, the first block of text leading to the first h1 block is written into 000-intro.html, the first chapter will be written into 001-chapter.html and so on. Please modify it to taste.

The solution is a simple one: Upon encountering the h1 tag, close the last output file and open a new one.

Hai Vu....An interesting solution and my thanks to you for that. I'll try it and get back to you. I must add that I was really disappointed when I found out that you could not use seek() or tell() in a for loop to rewind the fp in python. Really surprised me. I may also have some problems with your solution(I'm learning python at the moment) because I heavily format the chapter header names before writing to the output file. But this isn't so bad as the rewind problem itself. Thanks again for your compact and original solution. — Bill Thompson, Nov 23 '15 at 01:24

score 0 · Answer 2 · edited May 23 '17 at 12:07

0

You are looping over your input file twice, which is likely causing your problems:

for line in infp:
    ...
    with open(path, 'wt', encoding=('utf-8')) as outfp:            
        ...
        for line in infp:
            ...

Each for is going to have it's own iterator, so you are going to loop over the file many times.

You might try transforming your for loop into a while so you're not using two different iterators:

while infp: 
    line = infp.readline()
    if '<h1' in line:
        with open(...) as outfp:
            while infp:                
                line = infp.readline()
                if '<h1' in line:
                    break
                outfp.writeline(...)

Alternatively, you may wish to use an HTML parser (i.e., BeautifulSoup). Then you can do something like what is described here: https://stackoverflow.com/a/8735688/65295.

Update from comment - essentially, read the entire file all at once so you can freely move back or forward as necessary. This probably won't be a performance issue unless you have a really really big file (or very little memory).

lines = infp.readlines() # read the entire file
i = 0
while i < len(lines): 
    if '<h1' in lines[i]:
        with open(...) as outfp:
            j = i + 1
            while j < len(lines):
                if '<h1' in lines[j]:
                    break
                outfp.writeline(lines[j])
        # line j has an <h1>, set i to j so we detect the it at the
        # top of the next loop iteration. 
        i = j
    else:
        i += 1

edited May 23 '17 at 12:07

Community

1
1

answered Nov 23 '15 at 00:12

Seth

45,033
10
85
120

Thanks for your suggestion Seth, but I'm afraid that your code -- using while with infp.readlines() -- won't work because it does not rewind the file infp pointer back by one line. I need to be able to do this so that the next h1 html tag can be found on the next iteration at the top of the code. So this problem is not an html parsing problem -- its a file pointer rewind problem. How can I rewind the infp back by one line in python when infp.seek() or infp.tell() don't work? That's the question that needs to be answered. – Bill Thompson Nov 23 '15 at 01:10
I see what you mean. You might just read the entire file into memory first - see update. – Seth Nov 23 '15 at 01:20
Seth...Thanks for the update -- it lookss workable. I'll try slurping up the file into memory first as you suggest. I'll get back to you. – Bill Thompson Nov 23 '15 at 03:43
Hi Seth..Just tested your update codeand its not working. here is what I did: – Bill Thompson Nov 23 '15 at 04:22
Hi @Seth..Just tested your update code and its not working. Here is what I got: print(inpath) --> inpath was there infp = open(inpath, 'rt') lines = infp.readlines() # read the entire file print(str(len(lines))) --> str(lines) = 0 while i < len(lines): --> #if len(lines) = 0 then no while loop iterations occurs if '
– Bill Thompson Nov 23 '15 at 04:29

Bill Thompson · Answer 3 · 2015-11-25T03:19:17.200

I eventually found the answer to the above problem. The code below does alot more that just get the file header. It also simultaneously loads two parallel list arrays with formatted file name data(with extension) and pure header name data respectively so I can use these lists to fill in the and formatted filename extension in these html files within a while loop in one hit. The code now works well and is shown below.

def splitHeaderstoFiles(dir, inpath):
count = 1
t_count = 0
out_path = ''
header = ''
write_bodytext = False
file_path_names = []
pure_header_names = []

inpath = dir + os.sep + inpath
with open(inpath, 'rt', encoding=('utf-8')) as infp:

    for line in infp:

        if '<h1' in line:                
            #strip html tags, convert to start caps
            p = re.compile(r'<.*?>')
            header = p.sub('', line)
            header = capwords(header)
            line_save = header

            # Add 0 for count below 10
            if count < 10: 
                header = '0' + str(count) + '_' + header
            else:
                header = str(count) + '_' + header              

            # remove all spaces + add extension in header
            header = header.replace(' ', '_')
            header = header + '.xhtml'
            count = count + 1

            #create two parallel lists used later 
            out_path = dir + os.sep + header
            outfp = open(out_path, 'wt', encoding=('utf-8'))
            file_path_names.insert(t_count, out_path)
            pure_header_names.insert(t_count, line_save)
            t_count = t_count + 1

            # Add html meta headers and write it 
            outfp = addMainHeaders(outfp)
            outfp.write(line)
            write_bodytext = True

        # add header bodytext   
        elif write_bodytext == True:
            outfp.write(line)

# now add html titles and close the html tails on all files    
max_num_files = len(file_path_names)
tmp = dir + os.sep + 'temp1.tmp'
i = 0

while i < max_num_files:
    outfp = open(tmp, 'wt', encoding=('utf-8'))     
    infp = open(file_path_names[i], 'rt', encoding=('utf-8'))

    for line in infp:
        if '<title>'  in line:
            line = line.strip(' ')
            line = line.replace('<title></title>', '<title>' +    pure_header_names[i] + '</title>')
            outfp.write(line)
        else:
            outfp.write(line)            

    # add the html tail
    if '</body>' in line or '</html>' in line:
        pass
    else:            
        outfp.write('  </body>' + '\n</html>')    

    # clean up
    infp.close()
    outfp.close()
    shutil.copy2(tmp, file_path_names[i])
    os.remove(tmp) 
    i = i + 1                

# now rename just the title page
if os.path.isfile(file_path_names[0]):    
    title_page_name = file_path_names[0]
    new_title_page_name = dir + os.sep + '01_Title.xhtml'    
    os.rename(title_page_name, new_title_page_name)
    file_path_names[0] = '01_Title.xhtml'
else:
    logmsg27(DEBUG_FLAG)
    os._exit(0) 

# xhtml file is no longer needed    
if os.path.isfile(inpath):
    os.remove(inpath)    

# returned list values are also used 
# later to create epub opf and ncx files
return(file_path_names, pure_header_names)

@Hai Vu and @Seth -- Thanks for all your help.

Python -- How to split headers/chapters into separate files automatically

3 Answers3