4

Let's say I have this file:

1
17:02,111
Problem report related to
router

2
17:05,223
Restarting the systems

3
18:02,444
Must erase hard disk
now due to compromised data

I want this output:

1
17:02,111
Problem report related to router

2
17:05,223
Restarting the systems

3
18:02,444
Must erase hard disk now due to compromised data

Been trying in bash and got to a kind of close solution but I don't know how to carry this out on Python.

Thank you in advance

aDoN
  • 1,877
  • 4
  • 39
  • 55

3 Answers3

4

If you want to remove the extea lines :

For this aim you can check 2 condition for each like one if the line don't followed by an empty new line, or line should precede by a line that match with following regex ^\d{2}:\d{2},\d{3}\s$.

So for access to next line in each iteration you can create one file object from your main file object with the name temp using itertools.tee and apply the next function on it. and use re.match to match the regex.

from itertools import tee
import re
with open('ex.txt') as f,open('new.txt','w') as out:
    temp,f=tee(f)
    next(temp)
    try:
        for line in f:
            if next(temp) !='\n' or re.match(r'^\d{2}:\d{2},\d{3}\s$',pre):
                out.write(line)
            pre=line
    except :
        pass

result :

1
17:02,111
Problem report related to

2
17:05,223
Restarting the systems

3
18:02,444
Must erase hard disk

If you want to concatenate the rest to third line :

And if you want to concatenate the rest lines after third line to third line you can use following regex to find all blocks that followed by \n\n or the end of file ($) :

r"(.*?)(?=\n\n|$)"

then split your blocks based on the line that in in a date format and write the parts to your output file, but note that you need to replace the new lines within 3rd part with space :

ex.txt:

1
17:02,111
Problem report related to
router
another line


2
17:05,223
Restarting the systems

3
18:02,444
Must erase hard disk
now due to compromised data
line 5
line 6
line 7

Demo :

def splitter(s):
    for x in re.finditer(r"(.*?)(?=\n\n|$)", s,re.DOTALL):
          g=x.group(0)
          if g:
            yield g

import re
with open('ex.txt') as f,open('new.txt','w') as out:
    for block in splitter(f.read()):
        first,second,third= re.split(r'(\d{2}:\d{2},\d{3}\n)',block)
        out.write(first+second+third.replace('\n',' '))

result :

1
17:02,111
Problem report related to router another line
2
17:05,223
Restarting the systems
3
18:02,444
Must erase hard disk now due to compromised data line 5 line 6 line 7

Note :

In this answer the splitter function returns a generator that is very efficient when you are dealing with huge files and refuse of storing unusable lines in memory.

Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • @aDoN I updated the answer with an approach using file also in that case you don't need to use `cat` and `pip`. – Mazdak Jun 19 '15 at 10:06
  • correct me if I am wrong isn't your output wrong the third lines are removed but he wants it to be appended to second right – The6thSense Jun 19 '15 at 11:08
  • 1
    @VigneshKalai Yeah thanks for reminding that seem i missed that or OP has edited the question! – Mazdak Jun 19 '15 at 12:01
  • 1
    Nice answer though :p – The6thSense Jun 19 '15 at 12:02
  • That solution work wonders but I have a doubt: What does `(.*?)` do¿? I mean here `"(.*?)(?=\n\n|$)"` because I guess x.group(0) are the ones that match `\n\n|$` Thanks – aDoN Jun 19 '15 at 16:55
  • @aDoN `(?=)` is a positive look ahead that for `(.*?)(?=\n\n|$)` will match any thing `(.*?)` that followed by 2 new line character or the end of your string `(?=\n\n|$)`! – Mazdak Jun 19 '15 at 16:58
  • I am not sure why not `(*)(?=\n\n|$)` for example (I know it doesn't work but don't know why), don't understand the `?` at the end nor the `.` at the beginning – aDoN Jun 19 '15 at 17:03
  • because `*` actually Matches 0 or more of the preceding token.and you need to specify a token for it, in this case we used dot `.` that will match any character! – Mazdak Jun 19 '15 at 17:07
  • So what about `?` and `*` ? Thank you. Totally based on your solution I got to this: `import sys import re text = sys.stdin.read() text_splitted = re.split(r'\n\n|$', text) for block in text_splitted: first,second,third = re.split(r'(\d{2}:\d{2}:\d{2},\d{3}\n)',block) print first,second,third.replace('\n',' ')+'\n'` – aDoN Jun 19 '15 at 17:20
  • @aDoN Its none greedy read more here http://www.rexegg.com/regex-quantifiers.html#lazy_solution – Mazdak Jun 19 '15 at 19:27
2

This works well if and only if the file as per your given sample

Note:

There may be a faster way if regex is used and it might also be simpler but wanted to do it in a logical way

Code:

inp=open("output.txt","r")
inp=inp.read().split("\n")
print inp
tempString=""
output=[]
w=0

for s in inp:
    if s:
        if any(c.isalpha() for c in s):
            tempString=tempString+" "+s
        else:
            w=0
            if tempString:
                output.append(tempString.strip())
                tempString=""
            output.append(s)       

    else:
        if tempString:
            output.append(tempString.strip())
            tempString=""
        output.append(" ")
if tempString:
    output.append(tempString.strip())


print "\n".join(output)
out=open("newoutput.txt","w")
out.write("\n".join(output))
out.close()

Input:

1
17:02,111
Problem report related to
2 router

2
17:05,223
Restarting the systems

3
18:02,444
Must erase hard disk
now due to compromised data

4
17:02,111
Problem report related to
router

output:

1
17:02,111
Problem report related to 2 router

2
17:05,223
Restarting the systems

3
18:02,444
Must erase hard disk now due to compromised data

4
17:02,111
Problem report related to router
The6thSense
  • 8,103
  • 8
  • 31
  • 65
1
x="""1
17:02,111
Problem report related to
router

2
17:05,223
Restarting the systems

3
18:02,444
Must erase hard disk
now due to compromised data
or something"""
def repl(matchobj):
    ll=matchobj.group().split("\n")
    return "\n".join(ll[:3])+" "+" ".join(ll[3:])
print re.sub(r"\b\d+\n\d+:\d+,\d+\b[\s\S]*?(?=\n{2}|$)",repl,x)

You can use re.sub with your own custom replacement feature.

vks
  • 67,027
  • 10
  • 91
  • 124