I have a .txt file that contains a lot of textual information I need for my research. So, I'm trying to write a program that does a keyword search (in my case, I need the phrase "sold salt"), after which it writes to a new file line by line the text that starts with this phrase and cuts off at some point(I'm not decided yet). It's actually a book that containts digitalized documents of 17th century, written in old Russian, but schematically text looks like:
"sheet_№1
text text text text
text text
text text text text text text sold salt text text text text text sold salt text text text text text text text
text text text text
sheet_№1_reverse
text text sold salt text text text text text text text text text text text text text text"
So it's a really bad structured thing and what I want is to have all the salt sale records with their position in the whole text in one file for my research.
Now, sorry for a long introduction, I just wanted to show what I've got to deal with.
I tried to make a code using docx lib, but it turned out that the only way it can work out is if I underline needed information in docx file, and than take it out by using a code, which is not really bad, but it still takes time.
So I stopped in txt format and now I've got this:
key_1 = 'sold'
key_2 = 'salt'
f_old = open("text.txt", encoding='utf-8')
f_result = open("text_result.txt", 'w', encoding='utf-8')
for line in f_old:
line = line.split()
if len(line) == 1:
for elem in range(len(line)):
f_result.write(line[elem] + '\n')
else:
if key_1 in line and key_2 in line:
for word in range(len(line)):
if line[word] == key_1 and line[word + 1] == key_2:
for elem in line[word: word + 10]:
f_result.write(elem + ' ')
f_result.write('\n')
f_old.close()
f_result.close()
based on the example above it gives me this result:
"sheet_№1
sold salt text text text text text sold salt text
sold salt text text text text text text
sheet_№1_reverse
sold salt text text text text text text text text"
It is not a big deal to cut "sold salt" and other extra information like in the end of the 2nd line by my hands, because I will anyway do it with lines that will contain more information than I need. However if there is any ideas how to cut lines if my keyword shows up in the line twice and more?
I'm having an idea of opening text_result not only for writing, but also for reading and then cut the lines by this:
for line in f_result:
line = line.split()
if len(line) > 1:
for word in line[::-1]:
while line[word] != key_1:
line.pop([word])
But it doesn't work if I put it in the code like this:
key_1 = 'sold'
key_2 = 'salt'
f_old = open("text.txt", encoding='utf-8')
f_result = open("text_result.txt", 'w+', encoding='utf-8')
for line in f_old:
line = line.split()
if len(line) == 1:
for elem in range(len(line)):
f_result.write(line[elem] + '\n')
else:
if key_1 in line and key_2 in line:
for word in range(len(line)):
if line[word] == key_1 and line[word + 1] == key_2:
for elem in line[word: word + 7]:
f_result.write(elem + ' ')
f_result.write('\n')
for line in f_result:
line = line.split()
if len(line) > 1:
for word in line[::-1]:
while line[word] != key_1:
line.pop([word])
f_old.close()
f_result.close()
Am I just missing some basic thing?
Thanks in advance!!!