1

I have a multiple sequence alignment file in which the lines from the different sequences are interspersed, as in the format outputed by clustal and other popular multiple sequence alignment tools. It looks like this:

TGFb3_human_used_for_docking        ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY
tr|B3KVH9|B3KVH9_HUMAN              ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY
tr|G3UBH9|G3UBH9_LOXAF              ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY
tr|G3WTJ4|G3WTJ4_SARHA              ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY


TGFb3_human_used_for_docking        LRSADTTHST-
tr|B3KVH9|B3KVH9_HUMAN              LRSADTTHST-
tr|G3UBH9|G3UBH9_LOXAF              LRSTDTTHST-
tr|G3WTJ4|G3WTJ4_SARHA              LRSADTTHST-

Each line begins with a sequence identifier, and then a sequence of characters (in this case describing the amino acid sequence of a protein). Each sequence is split into several lines, so you see that the first sequence (with ID TGFb3_human_used_for_docking) has two lines. I want to convert this to a format in which each sequence has a single line, like this:

TGFb3_human_used_for_docking        ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHST-
tr|B3KVH9|B3KVH9_HUMAN              ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHST-
tr|G3UBH9|G3UBH9_LOXAF              ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSTDTTHST-
tr|G3WTJ4|G3WTJ4_SARHA              ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHST-      

(In this particular examples the sequences are almost identical, but in general they aren't!)

How can I convert from multi-line multiple sequence alignment format to single-line?

seaotternerd
  • 6,298
  • 2
  • 47
  • 58
a06e
  • 18,594
  • 33
  • 93
  • 169
  • If you don't want to mess with scripting, you can open the file in an alignment editor (I recommend aliview) and save it in phylip format (relaxed, non-interleaved). – heathobrien May 11 '15 at 09:18
  • 2
    Just a word of caution. I've written code to convert aln files to one lines and found a lot of software that reads aln files will crash or freeze if the line is too long. Some software appears to use 2000bp buffers. – dkatzel May 11 '15 at 19:10
  • Non-scripted editor-only solution: Open it in [vim](http://www.vim.org). Go to start of the second set of reads. Hit `Ctrl V` (i.e. both keys at the same time for visual-block mode). Move cursor to highlight sequence block. Hit `x` (i.e. cut block). Go to the end of the first sequence block. Hit `p` (i.e. paste block). Repeat for the remaining blocks. Once you've gotten all the sequences in one line, go to the line just after your last sequence. Hit `d` then `G` to delete to the end of the file. Hit `:` then `w` then `q` then `Enter` to write (save) and quit. – Christopher Bottoms May 12 '15 at 12:20

2 Answers2

0

Looks like you need to write a script of some sort to achieve this. Here's a quick example I wrote in Python. It won't line the white-space up prettily like in your example (if you care about that, you'll have to mess around with formatting), but it gets the rest of the job done

#Create a dictionary to accumulate full sequences
full_sequences = {}

#Loop through original file (replace test.txt with your file name)
#and add each line to the appropriate dictionary entry
with open("test.txt") as infile:
    for line in infile:
        line = [element.strip() for element in line.split()]
        if len(line) < 2:
            continue
        full_sequences[line[0]] = full_sequences.get(line[0], "") + line[1]

#Now loop through the dictionary and write each entry as a single line
outstr = ""
with open("test.txt", "w") as outfile:
    for seq in full_sequences:
        outstr += seq + "\t\t" + full_sequences[seq] + "\n"

    outfile.write(outstr)
seaotternerd
  • 6,298
  • 2
  • 47
  • 58
0

I will recommend you trimal to change aligment format to phylip_paml.

In your case you should run the following line:

trimal -in inputfile -out outputfile -phylip_paml
juan trinidad
  • 130
  • 2
  • 3