1

Sample Text File:

1. some text here
2. more text here
more text here
more text here
more text here
3. more text here
more text here
more text here
more text here
4. more text here
more text here
more text here
more text here
5. more text here
more text here
more text here
more text here
6. last text here
more text here
more text here
more text here

1. new text here
more text here
more text here
2. some more text
more text here
3. a bit more text
more text here
4. ok this is enough text.

1. nawww heres a bit more text.
more text here
more text here
2. okay this is the final text.
more text here
more text here
3. just to be sure this is last.
more text here
1. etc

This is a sample text from what I have, but this is a lot shorter.

I have this python code as a start:

with open("text.txt") as txt_file:
    lines = txt_file.readlines()
    for line in lines:
        if line.startswith('1.'):
            print(line)

But I am stuck with the fact that I have no idea how to print all the lines after the 1., to the next 1. into a separate file

I'm assuming that I'd have to have some sort of for loop in the last if statement I have there, but i'm not sure how to go about doing that.

For an example of what I expect my results to be is this:

If a line starts with 1.. Write the text and after that into a new text file until the next line that starts with 1., then start the whole process over again until there is no more text. So I for the sample text above I should have 4 files.

In this case file number 1. would have all text from the paragraphs from 1-6.

1. some text here
2. more text here
more text here
more text here
more text here
3. more text here
more text here
more text here
more text here
4. more text here
more text here
more text here
more text here
5. more text here
more text here
more text here
more text here
6. last text here
more text here
more text here
more text here

File number 2. would have all the text from the second 1. in the sample text file from all paragraphs 1-4

1. new text here
more text here
more text here
2. some more text
more text here
3. a bit more text
more text here
4. ok this is enough text.

File number 3. would have all the text from the third 1. in the sample text file from all paragraphs from 1-3

1. nawww heres a bit more text.
more text here
more text here
2. okay this is the final text.
more text here
more text here
3. just to be sure this is last.
more text here

And so one...

I hope i'm explaining this right and in a way that makes sense.

JareBear
  • 467
  • 9
  • 27

4 Answers4

4

One simple approach would be to split the file at each line that starts with 1.:

import re
with open("text.txt") as txt_file:
    content = txt_file.read()
    chunks = []
    for match in re.split(r"(?=^1\.)", content, flags=re.MULTILINE):
        if match:
            chunks.append(match)

Now you have a list of texts each starting with 1. that you can iterate over and save to individual files.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • I'm not sure if I am using your code wrong, but when I print `chunks` I get one element in a list, should I be expecting 4 elements? Because there are `4` `ones` in the `text.txt` file. – JareBear Mar 10 '20 at 16:48
  • Hmm, which Python version are you using? You need at least 3.7 – Tim Pietzcker Mar 10 '20 at 16:49
  • I am using `3.7.5` I think its the: `(?=^1\.)`? Because doesn't that add just every line to the `chunk`? I changed it to: `(?=1\.)` and i got 4 lists – JareBear Mar 10 '20 at 16:52
  • Hm, the multiline flag was supposed to take care of that. Without the ^, the split will occur on any 1., not just at the beginning of a line. – Tim Pietzcker Mar 10 '20 at 17:32
  • Well hey, i'm just saying when I remove that `^` everything worked as I expected, I got 4 lists and all of them had the right stuff inside them. Where as with the `^` I just had one element in the list with basically the whole text file – JareBear Mar 10 '20 at 17:40
  • I’ve made the flags parameter explicit, does it work now? If not, the only explanation I can think of is that there is whitespace before the 1. – Tim Pietzcker Mar 10 '20 at 17:42
1

Here's another solution. You can tweak this as you see fit, but I found the index of all lines that contained 1. then just wrote the lines in between those indexes to new files.

with open('test.txt') as f:
    lines = f.readlines()
    ones_index = []
    for idx, line in enumerate(lines):
        if '1.' in line:
            ones_index.append(idx)

    ones_index[len(lines):] = [len(lines)]

    for i in range(len(ones_index)-1):
        start = ones_index[i]
        stop = ones_index[i+1]
        with open('newfile-{}.txt'.format(i), 'w') as g:
            g.write('\n'.join(lines[start:stop]))

Edit: I just realized this didn't handle the very last range of lines at first. Added a new line to fix this.

Jon Behnken
  • 560
  • 1
  • 3
  • 14
  • This works really well. But. When it comes to the last `1.` in the file it doesn't write it. Which isn't a big deal really, because the actual file that I want to orginze like this has over 900 `1s` and if I only have to edit 1 by my self. is no big deal. – JareBear Mar 10 '20 at 16:39
  • @JareBear See my edit. I added the line `ones_index[len(lines):] = [len(lines)]` to fix that. Essentially, you add an index to the list of indexes that represents the last line, and then write the lines ranging from the last occurring `1.` to the end of the file. – Jon Behnken Mar 10 '20 at 16:46
  • Works flawlessly! Thank you so much! You just saved me hours if not weeks from extracting text by hand :D . – JareBear Mar 10 '20 at 16:49
0

you create a variable n = 0

n = 0
for i in range(k):  

   while(n == i):
       print(line)
       if line.startswith(str(k)+"."):
           n += 1

if you want you can create a dic that you can save your lines as 1.line = [] as lists. then you can create a csv file with pandas library. I hope this helps if I understand correctly.

  • I'm not trying to create a `.csv` file. This isn't exactly what the question was about, But I thank you very much for your contribution! The question has been answered above. – JareBear Mar 10 '20 at 18:21
0

If you wanted to avoid reading the whole file into memory, you could make a generator that collects groups as they come from the file line-by-line and yield them when you have a complete group. Something like:

def splitgroups(text):
    lines = None
    for line in text:
        if line.startswith("1."):
            if lines is not None:
                yield lines
            lines = line
        else:
            lines += line
    yield lines

with open(filepath) as text:
    # iterate over groups rather than lines
    # and do what you want with each chunk:
    for group in splitgroups(text):
        print("*********")
        print(group)
Mark
  • 90,562
  • 7
  • 108
  • 148
  • May I ask, Lets say that the text file has over 10,000 lines. Would this way increase the efficiency? Since the file I am actually working with has well over 1000s and 1000s of lines. I'm just curious. – JareBear Mar 10 '20 at 18:19
  • @JareBear it won't necessarily be faster, but it will be more memory efficient than reading the entire file into memory and then parsing it. It's one of the main reason Python encourages the use of iterators and generators. – Mark Mar 10 '20 at 19:03