-1

I want to extract the text between {textblock_content} and {/textblock_content}.

With this script below, only the 1st line of the introtext.txt file is going to be extracted and written in a newly created text file. I don't know why the script does not extract also the other lines of the introtext.txt.

f = open("introtext.txt")
r = open("textcontent.txt", "w")
for l in f.readlines():
    if "{textblock_content}" in l:
        pos_text_begin = l.find("{textblock_content}") + 19
        pos_text_end = l.find("{/textblock_content}")
        text = l[pos_text_begin:pos_text_end]
        r.write(text)

f.close()
r.close()

How to solve this problem?

Vickel
  • 7,879
  • 6
  • 35
  • 56
Sauer Lu
  • 11
  • 2
  • This code is looking line by line - meaning the begin and end text must be on a single line. Is that the intent? Or could there be newlines between these two sentinels? You don't check for a return of -1 on that second find. Perhaps that is involved. This code does process each line and assuming each line has the beginning and ending text, they should work. Although you may want `r.write(text + "\n")` – tdelaney Dec 16 '22 at 00:06

2 Answers2

0

When you call file.readlines() the file pointer will reach the end of the file. For further calls of the same, the return value will be an empty list so if you change your code to sth like one of the below code snippets it should work properly:

f = open("introtext.txt")
r = open("textcontent.txt", "w")
f_lines = f.readlines()
for l in f_lines:
    if "{textblock_content}" in l:
        pos_text_begin = l.find("{textblock_content}") + 19
        pos_text_end = l.find("{/textblock_content}")
        text = l[pos_text_begin:pos_text_end]
        r.write(text)

f.close()
r.close()

Also, you can implement it through with context manager like the below code snippet:

with open("textcontent.txt", "w") as r:
    with open("introtext.txt") as f:
        for line in f: 
            if "{textblock_content}" in l:
                pos_text_begin = l.find("{textblock_content}") + 19
                pos_text_end = l.find("{/textblock_content}")
                text = l[pos_text_begin:pos_text_end]
                 r.write(text)
Javad
  • 2,033
  • 3
  • 13
  • 23
  • Thank you very much for your help - For the 1st one the script breaks after the first line ( and it does not get till the end ({/textblock_content}) – Sauer Lu Dec 16 '22 at 00:48
  • The 2nd one gives me no output. But thanks a lot sir. – Sauer Lu Dec 16 '22 at 00:51
0

Your code actually working fine, assuming you have begin and end block in your line. But I think this is not what you dreamed of. You can't read multiple blocks in one line, and you can't read block which started and ended in different lines.

First of all take a look at the object which returned by open function. You can use method read in this class to access whole text. Also take a look at with statements, it can help you to make actions with file easier and safely. And to rewrite your code so it will read something between {textblockcontent} and {\textblockcontent} we should write something like this:

def get_all_tags_content(
    text: str,
    tag_begin: str = "{textblock_content}",
    tag_end: str = "{/textblock_content}"
) -> list[str]:

    useful_text = text
    ans = []

    # Heavy cicle, needs some optimizations
    # Works in O(len(text) ** 2), we can better
    while tag_begin in useful_text:
        useful_text = useful_text.split(tag_begin, 1)[1]
        if tag_end not in useful_text:
            break
        block_content, useful_text = useful_text.split(tag_end, 1)
        ans.append(block_content)
    return ans


with open("introtext.txt", "r") as f:
    with open("textcontent.txt", "w+") as r:
        r.write(str(get_all_tags_content(f.read())))

To write this function efficiently, so it can work with a realy big files on you. In this implementation I have copied our begin text every time out context block appeared, it's not necessary and it's slow down our program (Imagine the situation where you have millions of lines with content {textblock_content}"hello world"{/textblock_content}. In every line we will copy whole text to continue out program). We can use just for loop in this text to avoid copying. Try to solve it yourself

  • Thanks a lot. This seems to work. Thanks all for your amazin support guys! :) – Sauer Lu Dec 16 '22 at 00:59
  • What if there are more {textblock_content} {/textblock_content} tags in one .txt file. Is it possible to extract the text of the 2nd {textblock_content} {/textblock_content} too please? – Sauer Lu Dec 16 '22 at 01:02
  • We are extracting all of them. In this code we just printing them all in list. Take a look in output file it should contain something like `['content of first block', 'second', etc...]` – Egor shevchenko Dec 16 '22 at 01:31