0

trying to find a way of making this process work pythonically or at all. Basically, I have a really long text file that is split into lines. Every x number of lines there is one that is mainly uppercase, which should roughly be the title of that particular section. Ideally, I'd want the title and everything after to go into a text file using the title as the name for the file. This would have to happen 3039 in this case as that is as many titles will be there. My process so far is this: I created a variable that reads through a text file tells me if it's mostly uppercase.

def mostly_uppercase(text):
    threshold = 0.7
    isupper_bools = [character.isupper() for character in text]
    isupper_ints = [int(val) for val in isupper_bools]
    try:
        upper_percentage = np.mean(isupper_ints)
    except:
        return False
    if upper_percentage >= threshold:
        return True
    else:
        return False

Afterwards, I made a counter so that I could create an index and then I combined it:

counter = 0

headline_indices = []

for line in page_text:
    if mostly_uppercase(line):
        print(line)
        headline_indices.append(counter)
    counter+=1

headlines_with_articles = []
headline_indices_expanded = [0] + headline_indices + [len(page_text)-1]

for first, second in list(zip(headline_indices_expanded, headline_indices_expanded[1:])):
    article_text = (page_text[first:second])
    headlines_with_articles.append(article_text)

All of that seems to be working fine as far as I can tell. But when I try to print the pieces that I want to files, all I manage to do is print the entire text into all of the txt files.

for i in range(100):
    out_pathname = '/sharedfolder/temp_directory/' + 'new_file_' + str(i) + '.txt'
    with open(out_pathname, 'w') as fo:
        fo.write(articles_filtered[2])

Edit: This got me halfway there. Now, I just need a way of naming each file with the first line.

for i,text in enumerate(articles_filtered):
    open('/sharedfolder/temp_directory' + str(i + 1) + '.txt', 'w').write(str(text))
Jortiz
  • 26
  • 4

1 Answers1

0

One conventional way of processing a single input file involves using a Python with statement and a for loop, in the following way. I have also adapted a good answer from someone else for counting uppercase characters, to get the fraction you need.

def mostly_upper(text):
    threshold = 0.7
    ## adapted from https://stackoverflow.com/a/18129868/131187
    upper_count = sum(1 for c in text if c.isupper())
    return upper_count/len(text) >= threshold

first = True
out_file = None
with open('some_uppers.txt') as some_uppers:
    for line in some_uppers:
        line = line.rstrip()
        if first or mostly_upper(line):
            first = False
            if out_file: out_file.close()
            out_file = open(line+'.txt', 'w')
        print(line, file=out_file)
out_file.close()

In the loop, we read each line, asking whether it's mostly uppercase. If it is we close the file that was being used for the previous collection of lines and open a new file for the next collection, using the contents of the current line as a title.

I allow for the possibility that the first line might not be a title. In this case the code creates a file with the contents of the first line as its names, and proceeds to write everything it finds to that file until it does find a title line.

Bill Bell
  • 21,021
  • 5
  • 43
  • 58