1

I have a .txt file that contains a lot of textual information I need for my research. So, I'm trying to write a program that does a keyword search (in my case, I need the phrase "sold salt"), after which it writes to a new file line by line the text that starts with this phrase and cuts off at some point(I'm not decided yet). It's actually a book that containts digitalized documents of 17th century, written in old Russian, but schematically text looks like:

"sheet_№1

text text text text

text text

text text text text text text sold salt text text text text text sold salt text text text text text text text

text text text text

sheet_№1_reverse

text text sold salt text text text text text text text text text text text text text text"

So it's a really bad structured thing and what I want is to have all the salt sale records with their position in the whole text in one file for my research.

Now, sorry for a long introduction, I just wanted to show what I've got to deal with.

I tried to make a code using docx lib, but it turned out that the only way it can work out is if I underline needed information in docx file, and than take it out by using a code, which is not really bad, but it still takes time.

So I stopped in txt format and now I've got this:

key_1 = 'sold'
key_2 = 'salt'

f_old = open("text.txt", encoding='utf-8')
f_result = open("text_result.txt", 'w', encoding='utf-8')

for line in f_old:
    line = line.split()
    if len(line) == 1:
        for elem in range(len(line)):
            f_result.write(line[elem] + '\n')
    else:
        if key_1 in line and key_2 in line:
            for word in range(len(line)):
                if line[word] == key_1 and line[word + 1] == key_2:
                    for elem in line[word: word + 10]:
                        f_result.write(elem + ' ')
                    f_result.write('\n')

f_old.close()
f_result.close()

based on the example above it gives me this result:

"sheet_№1

sold salt text text text text text sold salt text

sold salt text text text text text text

sheet_№1_reverse

sold salt text text text text text text text text"

It is not a big deal to cut "sold salt" and other extra information like in the end of the 2nd line by my hands, because I will anyway do it with lines that will contain more information than I need. However if there is any ideas how to cut lines if my keyword shows up in the line twice and more?

I'm having an idea of opening text_result not only for writing, but also for reading and then cut the lines by this:

for line in f_result:
    line = line.split()
    if len(line) > 1:
        for word in line[::-1]:
            while line[word] != key_1:
                line.pop([word])

But it doesn't work if I put it in the code like this:

key_1 = 'sold'
key_2 = 'salt'
f_old = open("text.txt", encoding='utf-8')
f_result = open("text_result.txt", 'w+', encoding='utf-8')

for line in f_old:
    line = line.split()
    if len(line) == 1:
        for elem in range(len(line)):
            f_result.write(line[elem] + '\n')
    else:
        if key_1 in line and key_2 in line:
            for word in range(len(line)):
                if line[word] == key_1 and line[word + 1] == key_2:
                    for elem in line[word: word + 7]:
                        f_result.write(elem + ' ')
                    f_result.write('\n')

for line in f_result:
    line = line.split()
    if len(line) > 1:
        for word in line[::-1]:
            while line[word] != key_1:
                line.pop([word])

f_old.close()
f_result.close()

Am I just missing some basic thing?

Thanks in advance!!!

  • How would you like the actual result to look? – Marko Oct 25 '20 at 15:12
  • 1
    @Marko, yes, you've got my idea. Thanks for your attention and useful answer! I'm also happy that you have suggested me to try "enumerate", since I'm beginner in Python and in programming! =) – Artyom Husak Oct 25 '20 at 15:53

1 Answers1

2

So based on the information you have provided, I supose you want to stop writing when you see another sold salt and then continue writing from there. This means that while writing you just need to make another check (like the one you already do) that the words that are going to the new file are not sold salt, and if they are, break out of there. It would look like this:

for line in f_old:
    line_words = line.split()  # it is confusing changing the value of a variable within the
    # loop, so I would recommend simply creating a new variable
    if len(line_words) == 1:
        # there was no need for a for loop here as we already know that there is only one element
        f_result.write(line_words[0] + '\n')
    else:
        for word in range(len(line_words)-1):  # as you will be accessing word+1 element,
        # you need to look out for out of range indices
            if line_words[word] == key_1 and line_words[word + 1] == key_2:
                for i in range(len(line_words[word: word + 10]))):
                    if i != 0 and line_words[word+i] == key_1 and line_words[word+i+1] == key_2:
                        break

                    f_result.write(line_words[word+i] + ' ')
                f_result.write('\n')


f_result.close()

I would also recommend using enumerate and then just using indices to access the element behind the one you need, I think it gives a cleaner code.

Marko
  • 733
  • 8
  • 21