How to split a file by using string as identifier with python?

Question

I have a huge text file and need to split it to some file. In the text file there is an identifier to split the file. Here is some part of the text file looks like:

Comp MOFVersion 10.1
Copyright 1997-2006. All rights reserved.
-------------------------------------------------- 
Mon 11/19/2022 8:34:22.35 - Starting The Process... 
-------------------------------------------------- 

There are a lot of content here
...

exit 
--------------------- 
list volume 
list partition 
exit
--------------------- 

Volume 0 is the selected volume.

Disk ###  Status         Size     Free     Dyn  Gpt
--------  -------------  -------  -------  ---  ---
* Disk 0    Online          238 GB   136 GB        *

-------------------------------------------------- 
Tue 11/20/2022 8:34:22.35 - Starting The Process... 
-------------------------------------------------- 

There are a lot of content here
....
SERVICE_NAME: vds 
    TYPE               : 10  WIN32_OWN_PROCESS  
    STATE              : 1  STOPPED 
    WIN32_EXIT_CODE    : 0  (0x0)
    SERVICE_EXIT_CODE  : 0  (0x0)
    CHECKPOINT         : 0x0
    WAIT_HINT          : 0x0
--------------------- 
*exit /b 0 
File not found - *.*
0 File(s) copied

-------------------------------------------------- 
Wed 11/21/2022 8:34:22.35 - Starting The Process... 
-------------------------------------------------- 

There are a lot of content here

==========================================
Computer: .
==========================================
Active: True
DmiRevision: 0
list disk
exit
--------------------- 
*exit /b 0 

11/19/2021  08:34 AM    <DIR>          .
11/19/2021  08:34 AM    <DIR>          ..
11/19/2021  08:34 AM                 0 SL
               1 File(s)              0 bytes
               2 Dir(s)  80,160,923,648 bytes free

My expectation is split the file by mapping the string "Starting The Process". So if I have a text file like above example, then the file will split to 3 files and each file has differen content. For example:

file1
-------------------------------------------------- 
Mon 11/19/2022 8:34:22.35 - Starting The Process... 
-------------------------------------------------- 

There are a lot of content here
...

exit 
--------------------- 
list volume 
list partition 
exit
--------------------- 

Volume 0 is the selected volume.

Disk ###  Status         Size     Free     Dyn  Gpt
--------  -------------  -------  -------  ---  ---
* Disk 0    Online          238 GB   136 GB        *


file2
-------------------------------------------------- 
Tue 11/20/2022 8:34:22.35 - Starting The Process... 
-------------------------------------------------- 

There are a lot of content here
....
SERVICE_NAME: vds 
    TYPE               : 10  WIN32_OWN_PROCESS  
    STATE              : 1  STOPPED 
    WIN32_EXIT_CODE    : 0  (0x0)
    SERVICE_EXIT_CODE  : 0  (0x0)
    CHECKPOINT         : 0x0
    WAIT_HINT          : 0x0
--------------------- 
*exit /b 0 
File not found - *.*
0 File(s) copied

file 3
-------------------------------------------------- 
Wed 11/21/2022 8:34:22.35 - Starting The Process... 
-------------------------------------------------- 

There are a lot of content here

==========================================
Computer: .
==========================================
Active: True
DmiRevision: 0
list disk
exit
--------------------- 
*exit /b 0 

11/19/2021  08:34 AM    <DIR>          .
11/19/2021  08:34 AM    <DIR>          ..
11/19/2021  08:34 AM                 0 SL
               1 File(s)              0 bytes
               2 Dir(s)  80,160,923,648 bytes free

here is what i've tried:

logfile = "E:/DATA/result.txt"
with open(logfile, 'r') as text_file:
    lines = text_file.readlines()
    for line in lines:
        if "Starting The Process..." in line:
            print(line)

I am only able to find the line with the string, but I don't know how to get the content of each line after split to 3 parts and output to new file.

Is it possible to do it in Python? Thank you for any advice.

Yes, it should be possible to do what you want using Python. See [python regex documentation](https://docs.python.org/3/howto/regex.html) or you can do simple loops through every line of the file and compare strings, or you can simply read all the text (see [how to open files](https://docs.python.org/3/library/functions.html#open)) and use `split` ([python string.split documentation](https://docs.python.org/3/library/stdtypes.html?highlight=split#str.split)), then eventually use the documentation to open files to copy the content to new files. — SpaceBurger, Nov 22 '22 at 09:43

Tim Biegeleisen · Answer 1 · 2022-11-22T10:17:07.033

1

Well if the file is small enough to comfortably fit into memory (say 1GB or less), you could read the entire file into a string and then use re.findall:

with open('data.txt', 'r') as file:
    data = file.read()
    parts = re.findall(r'-{10,}[^-]*\n\w{3} \d{2}\/\d{2}\/\d{4}.*?-{10,}.*?(?=-{10,}|$)', data, flags=re.S)

cnt = 1
for part in parts:
    output = open('file ' + str(cnt), 'w')
    output.write(part)
    output.close()
    cnt = cnt + 1

edited Nov 22 '22 at 10:17

answered Nov 22 '22 at 09:38

Tim Biegeleisen

502,043
27
286
360

Thank you @Tim Biegeleisen, but the file split by using this identifier `--------------------- ` my expectation the file will by split with this string `Starting The Process...` because in the file there is some `--------------------- ` that its not supposed to split as new file – Cheries Nov 22 '22 at 10:06
Try my updated answer which uses the `-----` bands in a more specific way, requiring a single line only in between them. – Tim Biegeleisen Nov 22 '22 at 10:17
I tried it but still split unexpected part :( – Cheries Nov 22 '22 at 10:29
Then you need to edit your question and reveal the data which breaks the trend of what you posted. I can't answer to that which I cannot see. – Tim Biegeleisen Nov 22 '22 at 10:49
Biegelseisen I updated my question. Appreciate for your response. Thanks – Cheries Nov 23 '22 at 02:01

ChaoS Adm · Answer 2 · 2022-11-24T03:40:56.387

0

An alternative solution if the dashes in the file are of fixed length could be:

with open('file.txt', 'r') as f: 
split_text = f.read().split('--------------------------------------------------')
split_text.pop(0) # To remove the Copyright message at the start

for i in range(0, len(split_text) - 1, 2): 
    with open(f'file{int(i/2)}.txt', 'w') as temp: 
        temp_txt = ''.join(split_text[i:i+2])
        temp.write(temp_txt)

Essentially, I am just splitting on the basis of those dashes and joining every consecutive element. This way you keep the info about the timestamp with the content in each file.

edited Nov 24 '22 at 03:40

answered Nov 22 '22 at 09:54

ChaoS Adm

715
1
5
12

hi @Chaos Adm thanks for your answer. I already update my question and my expectation result. – Cheries Nov 24 '22 at 02:26
@Cheries the code above is still working if I use the file you provided as reference since the length of the dashes around "Starting the Process" is distinct from the rest of the content. I slightly edited the code. Try if this works! – ChaoS Adm Nov 24 '22 at 03:41
I tried it but if using that dash to split the content, It will split the content without "Starting The Process..." and the output file become more than 3. It supposed to have 3 output file because dash with Starting The Process... only has 3 – Cheries Nov 24 '22 at 06:04
For me, each file looks like this (which seems to be what you are trying to do): Mon 11/19/2022 8:34:22.35 - Starting The Process... There are a lot of content here ... exit --------------------- list volume list partition exit --------------------- Volume 0 is the selected volume. Disk ### Status Size Free Dyn Gpt -------- ------------- ------- ------- --- --- * Disk 0 Online 238 GB 136 GB * – ChaoS Adm Nov 25 '22 at 07:24
@Cheries kindly share the output after running the exact script above (including the for loop after the split) – ChaoS Adm Nov 25 '22 at 07:32

How to split a file by using string as identifier with python?

2 Answers2