parse, find chapters, and write out as separate files

Question

I am having difficulty getting the right code to parse out chapters from this ebook and then have the 27 chapters to print out into their own text file. the farthest i've gotten is to print "CHAPTER-1.txt". I don't want to hard code anything and am unsure where i've completely missed the mark.

infile = open('dracula.txt', 'r')

readlines = infile.readlines()

toc_list = readlines[74:185]

toc_text_lines = []
for line in toc_list:
    if len(line) > 1:
    stripped_line = line.strip()
    toc_text_lines.append(stripped_line)

#print(len(toc_text_lines))

chaptitles = []
for text_lines in toc_text_lines:
    split_text_line = text_lines.split()
    if split_text_line[-1].isdigit():
    chaptitles.append(text_lines)

#print(len(chaptitles))
print(chaptitles)

infile.close()

import re

with open('dracula.txt') as f:
   book = f.readlines()



while book:
    line = book.pop(0)
if "CHAPTER" in line and book.pop(0) == '\n':
    for title in chapters_names_list: ['CHAPTER I.', 'CHAPTER II.', 
                                                     'CHAPTER III.']
with open("{}.txt".format(chapters_names_list), 'w') :

score 0 · Answer 1 · answered Nov 05 '19 at 02:36

I think you could benefit from generators, suppose one of the ebooks is too big to fit into memory, you will have some issues.

What you can do is construct sort of a data processing pipeline, first look for the file(ebook.txt) in the filesystem, though have in mind that we need all functions to be as general as possible, once we have the filename we open it and yield one line at a time, and finally we scan each line for 'CHAPTER I.', 'CHAPTER II.', etc

import os
import re
import fnmatch

def find_files(pattern, path):
    """
    Here you can find all the filenames that match a specific pattern
    using shell wildcard pattern that way you avoid hardcoding
    the file pattern i.e 'dracula.txt'
    """
    for root, dirs, files in os.walk(path):
        for name in fnmatch.filter(files, pattern):
            yield os.path.join(root, name)

def file_opener(filenames):
    """
    Open a sequence of filenames one at a time
    and make sure to close the file once we are done 
    scanning its content.
    """
    for filename in filenames:
        if filename.endswith('.txt'):
            f = open(filename, 'rt')
        yield f
        f.close()

def chain_generators(iterators):
    """
    Chain a sequence of iterators together
    """
    for it in iterators:
        # Look up yield from if you're unsure what it does
        yield from it

def grep(pattern, lines):
    """
    Look for a pattern in a line i.e 'CHAPTER I.'
    """
    pat = re.compile(pattern)
    for line in lines:
        if pat.search(line):
            yield line

# A simple way to use these functions together

logs = find_files('dracula*', 'Path/to/files')
files = file_opener(logs)
lines = chain_generators(files)
each_line = grep('CHAPTER I.', lines)
for match in each_line:
    print(match)

You can build on top of these implementation to accomplish what you're trying to do.

Let me know if this helped.

Is a module, which provides functions for interacting with the operating system. — armrasec, Nov 05 '19 at 02:42
i'm supposed to print these out into text files and i'll be honest, i haven't seen a lot of the code that you are using..:( — isniffbooks, Nov 05 '19 at 02:55
for range in book(28): print("CHAPTER-".txt) *i added this to my code and got a TypeError: list is not callable — isniffbooks, Nov 05 '19 at 02:55
where did you add that piece of code?in the last loop for match in each_line: — armrasec, Nov 05 '19 at 04:55
yes, i ran it on python...just not sure where i'm effing up. — isniffbooks, Nov 05 '19 at 22:13
lst = ['CHAPTER I.', 'CHAPTER II.', 'CHAPTER III.', 'CHAPTER IV.', 'CHAPTER V.', 'CHAPTER VI.', 'CHAPTER VII.', 'CHAPTER VIII.', 'CHAPTER IX.', 'CHAPTER X.', 'CHAPTER XI.', 'CHAPTER XII.', 'CHAPTER XIII.', 'CHAPTER XIV.', 'CHAPTER XV.', 'CHAPTER XVI.', 'CHAPTER XVII.', 'CHAPTER XVIII.', 'CHAPTER XIX.', 'CHAPTER XX.', 'CHAPTER XXI.', 'CHAPTER XXII.', 'CHAPTER XXIII.', 'CHAPTER XXIV.', 'CHAPTER XXV', 'CHAPTER XXVI.', 'CHAPTER XXVII.'] chap = re.split(r'CHAPTER\s[A-Z.]+', book)[1:27] chapter = list(zip(lst, chap)) for c in chapter: print(''.join(c)) — isniffbooks, Nov 05 '19 at 22:40
so i added this and it prints out the chapters, but just the names and not the rest of the content. i want to avoid hardcoding but this is a struggle. — isniffbooks, Nov 05 '19 at 22:40

parse, find chapters, and write out as separate files

1 Answers1