-2

I have a problem with a very large text file which looks like following:

A T T A G C A
A AT A G C A
T TT AG G A
G T T A G C A

Every character was split by \t,but some characters are connected, I want to add \t to these sequence. What I need is like following:

A T T A G C A
A A T A G C A
T T T A G C A
G T T A G C A

What can I do in Python? and I need to fully use my computer memory to speed up the process.

Deduplicator
  • 44,692
  • 7
  • 66
  • 118

2 Answers2

1

Assuming the input is stored in in.txt, an elegant solution would be

import re

with open('in.txt') as fin, open('out.txt', 'w') as fout:
    for line in fin:
        fout.write('\t'.join(re.findall('\w', line))+'\n')

The output is stored in the file out.txt.

Ébe Isaac
  • 11,563
  • 17
  • 64
  • 97
  • Found 2 issues with this answer. **1.** Open out.txt for writing: `... open('out.txt', 'w') as fout:`. **2.** Use fout instead of out in the write stmt: `fout.write(...)` – clp2 Apr 30 '18 at 20:45
  • @clp2 Thanks for the suggestion! I've made the appropriate corrections. As a member of the site you may suggest such corrections as an edit to the post. – Ébe Isaac May 01 '18 at 06:17
0

I would probably write a copy of the original file like so.

with open('in.txt') as input, open('out.txt', 'w') as output:
    prev_char = None
    while True:
        c = input.read(1)
        if not c:
            break
        if prev_char and prev_char != '\t' and c != '\t':
            output.write('\t')
        output.write(c)
        prev_char = c
pscuderi
  • 1,554
  • 12
  • 14
  • thanks very much,but this code might cause a problem that the first line is okay but the following lines were added \t at the first of this line. – Zaichao Sheng Nov 14 '16 at 08:28