Merge two lines generated from contigs.fa

Question

I have a file generated by assemblers. It looks like following.

>NODE_1_length_211_cov_22.379147
CATTTGCTGAAGAAAAATTACGAGAAATGGAGCACAAGGCTGTTTTTGTGAATGTCAAAC
CAAGTGACAACTCTATAGCGTTTGTATAAGACTCTCATACTAATCCCAAGCAAACTCTAT
ACTGACGCATGAACATGGAAGAGAAATGCTGCTCGTGTATGTATTATGGACCAGCTTGGA
ACACCATGTTAGGACTTTATAGATGTCTTACGATTTTTTCGACGTGATGAAGAAGTCTAT
TCAGCATTTGA
>NODE_2_length_85_cov_19.094118
TACTCCTGAGCACTTTGTGCTCTTAGTTCTTACTAGAACTGTTACAGCTCCACGAACTTG
TCGACTCTTTGAGTCAATTTCTGTTAGTTCCTACGAACTAAGAGGCTCTCTGAGCCCAGT
CTTCC

I want to merge the lines using python or linux sed command and want result in this way.

>NODE_1_length_211_cov_22.379147
CATTTGCTGAAGAAAAATTACGAGAAATGGAGCACAAGGCTGTTTTTGTGAATGTCAAACCAAGTGACAACTCTATAGCGTTTGTATAAGACTCTCATACTAATCCCAAGCAAACTCTATACTGACGCATGAACATGGAAGAGAAATGCTGCTCGTGTATGTATTATGGACCAGCTTGGAACACCATGTTAGGACTTTATAGATGTCTTACGATTTTTTCGACGTGATGAAGAAGTCTATTCAGCATTTGA
>NODE_2_length_85_cov_19.094118
TACTCCTGAGCACTTTGTGCTCTTAGTTCTTACTAGAACTGTTACAGCTCCACGAACTTGTCGACTCTTTGAGTCAATTTCTGTTAGTTCCTACGAACTAAGAGGCTCTCTGAGCCCAGTCTTCC

like every seqeunce consider as single line and Node name as other line.

Welcome to Stack Overflow! It looks like you want us to write some code for you. While many users are willing to produce code for a coder in distress, they usually only help when the poster has already tried to solve the problem on their own. A good way to demonstrate this effort is to include the code you've written so far, example input (if there is any), the expected output, and the output you actually get (console output, stack traces, compiler errors - whatever is applicable). The more detail you provide, the more answers you are likely to receive. — Martijn Pieters, Dec 21 '12 at 10:43
@MartijnPieters the question mark was a typo, I think `>` are part of the file, looks like [FASTA](http://en.wikipedia.org/wiki/FASTA_format) to me — Chris Seymour, Dec 21 '12 at 10:51
Yes, show us some love by adding the code you have come up with so far. — hochl, Dec 21 '12 at 10:51
@sudo_O: Ah, yes, I thought they looked familiar (from other questions on SO, not a geneticist myself). — Martijn Pieters, Dec 21 '12 at 10:52
the code i used is following... f=open('contigsss.fa','r') lines=f.readlines() g=open('contigser.fa','wb') y=str(''.join(lines)) finish=() for i in range(0,len(y)): if (y[i] is '\n') and (y[i-1] is 'A'): finish.append('') if (y[i] is '\n') and (y[i-1] is 'T'): finish.append('') if (y[i] is '\n') and (y[i-1] is 'G'): finish.append('') if (y[i] is '\n') and (y[i-1] is 'C'): finish.append('') else: finish.append(y[i]) for i in finish: g.write(str(i)) — user1921307, Dec 21 '12 at 11:34
You probably shouldn't do that. Newlines in sequences are allowed by FASTA format. Use a real existing FASTA parser to read the files (or develop one for exercise). Also, please add your code with proper formatting directly to your question using the [edit] link. — Lev Levitsky, Dec 21 '12 at 11:40
please add the example code into the text of your question, not as a comment. — hochl, Dec 21 '12 at 11:43

Chris Seymour · Answer 1 · 2012-12-21T12:06:57.693

2

A small pipe of tr and sed would do this:

$ tr -d '\n' < contigser.fa | sed 's/\(>[^.]\+\.[0-9]\+\)/\n\1\n/g' > newfile.fa

In python:

file = open('contigser.fa','r+')
lines= file.read().splitlines()

file.seek(0)
file.truncate()

for line in lines:
    if line.startswith('>'):
        file.write('\n'+line+'\n')
    else:
        file.write(line)

Note: the python solution stores the changes back to contigser.fa.

edited Dec 21 '12 at 12:06

answered Dec 21 '12 at 10:57

Chris Seymour

83,387
30
160
202

sure this wouldn't give `>NODE_2_length_85_cov_19.094118TACTCCTGAGCACTTTGTGCTCTTAGTTCT ...` in the output on one line? – hochl Dec 21 '12 at 10:59
but generally the idea is good ;-) I think you need to include some awkishness. – hochl Dec 21 '12 at 11:01
@hochl fixed, just needed to make `sed` grab the whole line. – Chris Seymour Dec 21 '12 at 11:06
hochl is right..in most cases im getting it like it is merging things with node... i used the cat function..cat contigser.fa | tr -d '\n' > contigsers.fa ... and it is converting everything to single line..i think it might help if i put \n after the numbers cov_blah balh – user1921307 Dec 21 '12 at 11:36
Why did you do `contigser.fa | tr -d '\n' > contigsers.fa` you missed out the `sed` command do `contigser.fa | tr -d '\n' | sed 's/$>[^.]\+\.[0-9]\+$/\n\1\n/g' > contigsers.fa` or use the python solution I added. – Chris Seymour Dec 21 '12 at 11:42
the python code u gave just erase everything and im getting an empty file. – user1921307 Dec 21 '12 at 11:49
Let me guess, you're getting an empty file called `file`? You want `open('YOURFILENAME','r+')` so `file = open('contigser.fa','r+')` I have added your filename in the code. Be aware this edits the file not creates a new one. – Chris Seymour Dec 21 '12 at 11:54
i open the file and it is still giving me the empty file..it just erase everything. the code given by hochl is working fine but can anyone just tell me how to create file that it is printing – user1921307 Dec 21 '12 at 11:56
Both my solution are tested and work. Check first that `contigser.fa` is **not** blank and just copy and paste this `contigser.fa | tr -d '\n' | sed 's/$>[^.]\+\.[0-9]\+$/\n\1\n/g' > newfile.fa` this will create the file you want in `newfile.fa` – Chris Seymour Dec 21 '12 at 12:00

hochl · Accepted Answer · 2012-12-21T14:52:33.820

1

You can use awk to do the job:

awk < input_file '/^>/ {print ""; print; next} {printf "%s", $0} END {print ""}'

This only starts one process (awk). Only drawback: it adds an empty first line. You can avoid such things by adding a state variable (the code belongs on one line, it's just to make it better readable):

awk < input_file '/^>/ { if (flag) print ""; print; flag=0; next }
    { printf "%s", $0; flag=1 } END { if (flag) print "" }'

@how to store it in a new file:

awk < input_file > output_file '/^>/ { .... }'

edited Dec 21 '12 at 14:52

answered Dec 21 '12 at 11:34

hochl

12,524
10
53
87

The synopsis for printf is `printf format,data`. Using it as `printf data` means the `data` is taken as your format string with no data and so will mess up badly if your input contains %s or any other printf formatting characters. You need `printf "%s",$0` rather than `printf $0`. – Ed Morton Dec 21 '12 at 14:32

score 0 · Answer 3 · answered Dec 21 '12 at 14:40

$ awk '/^>/{printf "%s%s\n",(NR>1?ORS:""),$0; next} {printf "%s",$0} END{print ""}' file
>NODE_1_length_211_cov_22.379147
CATTTGCTGAAGAAAAATTACGAGAAATGGAGCACAAGGCTGTTTTTGTGAATGTCAAACCAAGTGACAACTCTATAGCGTTTGTATAAGACTCTCATACTAATCCCAAGCAAACTCTATACTGACGCATGAACATGGAAGAGAAATGCTGCTCGTGTATGTATTATGGACCAGCTTGGAACACCATGTTAGGACTTTATAGATGTCTTACGATTTTTTCGACGTGATGAAGAAGTCTATTCAGCATTTGA
>NODE_2_length_85_cov_19.094118
TACTCCTGAGCACTTTGTGCTCTTAGTTCTTACTAGAACTGTTACAGCTCCACGAACTTGTCGACTCTTTGAGTCAATTTCTGTTAGTTCCTACGAACTAAGAGGCTCTCTGAGCCCAGTCTTCC

score 0 · Answer 4 · answered Dec 21 '12 at 15:57

$ awk 'NR==1;ORS="";{sub(/>.*$/,"\n&\n");print (NR>1)?$0:""}END{print"\n"}' file
>NODE_1_length_211_cov_22.379147
CATTTGCTGAAGAAAAATTACGAGAAATGGAGCACAAGGCTGTTTTTGTGAATGTCAAACCAAGTGACAACTCTATAGCGTTTGTATAAGACTCTCATACTAATCCCAAGCAAACTCTATACTGACGCATGAACATGGAAGAGAAATGCTGCTCGTGTATGTATTATGGACCAGCTTGGAACACCATGTTAGGACTTTATAGATGTCTTACGATTTTTTCGACGTGATGAAGAAGTCTATTCAGCATTTGA
>NODE_2_length_85_cov_19.094118
TACTCCTGAGCACTTTGTGCTCTTAGTTCTTACTAGAACTGTTACAGCTCCACGAACTTGTCGACTCTTTGAGTCAATTTCTGTTAGTTCCTACGAACTAAGAGGCTCTCTGAGCCCAGTCTTCC

potong · Answer 5 · 2012-12-22T09:02:19.627

0

This might work for you (GNU sed):

sed '/^>/n;:a;$!N;s/\n\([^>]\)/\1/;ta;P;D' file

Following a line beginning with >, delete any newlines that preceed any character other than a > symbol.

edited Dec 22 '12 at 09:02

answered Dec 22 '12 at 08:47

potong

55,640
6
51
83

Merge two lines generated from contigs.fa

5 Answers5