1

I am a beginner in python and I am using it for my master thesis, so I don't know that much. I have a bunch of annual reports (in txt format) files and I want to select all the text between "ITEM1." and "ITEM2.". I am using the re package. My problem is that sometimes, in those 10ks, there is a section called "ITEM1A.". I want the code to recognize this and stop at "ITEM1A." and put in the output the text between "ITEM1." and "ITEM1A.". In the code I attached to this post, I tried to make it stop at "ITEM1A.", but it does not, it continues further because "ITEM1A." appears multiple times through the file. I would be ideal to make it stop at the first one it sees. The code is the following:

import os
import re

#path to where 10k are
saved_path = "C:/Users/Adrian PC/Desktop/Thesis stuff/10k abbot/python/Multiple 10k/saved files/"

#path to where to save the txt with the selected text between ITEM 1 and ITEM 2
selected_path = "C:/Users/Adrian PC/Desktop/Thesis stuff/10k abbot/python/Multiple 10k/10k_select/"

#get a list of all the items in that specific folder and put it in a variable
list_txt = os.listdir(saved_path)


for text in list_txt:
    file_path = saved_path+text
    file = open(file_path,"r+", encoding="utf-8")
    file_read = file.read()
    # looking between ITEM 1 and ITEM 2
    res = re.search(r'(ITEM[\s\S]*1\.[\w\W]*)(ITEM+[\s\S]*1A\.)', file_read)
    item_text_section = res.group(1)
    saved_file = open(selected_path + text, "w+", encoding="utf-8")     # save the file with the complete names
    saved_file.write(item_text_section)                                 # write to the new text files with the selected text
    saved_file.close()                                                  # close the file
    print(text)                                                         #show the progress
    file.close()

If anyone has any suggestions on how to tackle this, it would be great. Thank you!

Adrian
  • 774
  • 7
  • 26

1 Answers1

5

Try the following regex:

ITEM1\.([\s\S]*?)ITEM1A\.

Adding the question mark makes it non-greedy thus it will stop at the first occurrence

ARR
  • 2,074
  • 1
  • 19
  • 28
  • 1
    @Adrian do not forget to accept Ahmad's answer and also, next time, try to edit your answer to clarify instead of 'answering' the question :) – axm__ Sep 29 '18 at 11:51
  • 2
    @axm__ yes I will. Sorry for the confusion. This was my first post. Will keep in mind for the future! Thank you again! – Adrian Sep 29 '18 at 12:01