Selective text using Python

Question

I am a beginner in python and I am using it for my master thesis, so I don't know that much. I have a bunch of annual reports (in txt format) files and I want to select all the text between "ITEM1." and "ITEM2.". I am using the re package. My problem is that sometimes, in those 10ks, there is a section called "ITEM1A.". I want the code to recognize this and stop at "ITEM1A." and put in the output the text between "ITEM1." and "ITEM1A.". In the code I attached to this post, I tried to make it stop at "ITEM1A.", but it does not, it continues further because "ITEM1A." appears multiple times through the file. I would be ideal to make it stop at the first one it sees. The code is the following:

import os
import re

#path to where 10k are
saved_path = "C:/Users/Adrian PC/Desktop/Thesis stuff/10k abbot/python/Multiple 10k/saved files/"

#path to where to save the txt with the selected text between ITEM 1 and ITEM 2
selected_path = "C:/Users/Adrian PC/Desktop/Thesis stuff/10k abbot/python/Multiple 10k/10k_select/"

#get a list of all the items in that specific folder and put it in a variable
list_txt = os.listdir(saved_path)


for text in list_txt:
    file_path = saved_path+text
    file = open(file_path,"r+", encoding="utf-8")
    file_read = file.read()
    # looking between ITEM 1 and ITEM 2
    res = re.search(r'(ITEM[\s\S]*1\.[\w\W]*)(ITEM+[\s\S]*1A\.)', file_read)
    item_text_section = res.group(1)
    saved_file = open(selected_path + text, "w+", encoding="utf-8")     # save the file with the complete names
    saved_file.write(item_text_section)                                 # write to the new text files with the selected text
    saved_file.close()                                                  # close the file
    print(text)                                                         #show the progress
    file.close()

If anyone has any suggestions on how to tackle this, it would be great. Thank you!

Could you maybe post an (anonimized) sample of the data? That would help us. — axm__, Sep 29 '18 at 11:11
I also attached a full annual report on this website: https://ufile.io/4vge5 . Hope it helps — Adrian, Sep 29 '18 at 11:24

score 5 · Accepted Answer · answered Sep 29 '18 at 11:24

5

Try the following regex:

ITEM1\.([\s\S]*?)ITEM1A\.

Adding the question mark makes it non-greedy thus it will stop at the first occurrence

answered Sep 29 '18 at 11:24

ARR

2,074
1
19
28

1

@Adrian do not forget to accept Ahmad's answer and also, next time, try to edit your answer to clarify instead of 'answering' the question :) – axm__ Sep 29 '18 at 11:51
2

@axm__ yes I will. Sorry for the confusion. This was my first post. Will keep in mind for the future! Thank you again! – Adrian Sep 29 '18 at 12:01

Selective text using Python

1 Answers1