R function for pairwise comparison of protein sequences

Question

I have 5,000 protein sequence .fasta files 2 different transcripts in each file, which are split by protein domain.

The file for one gene looks like this enter image description here

I would like to do a BLAST_p on the same domain sequences, and get an output for all the domains, for all the files. I can do this on Ensembl here, by copying each sequences in: https://blast.ncbi.nlm.nih.gov/Blast.cgi

The output looks like this: enter image description here

The closest way I have been able to code this is reading in the sequences into R, and using the msa() package, however, it treats the domains as individual sequences and does an alignment for all 4 sequences instead of the 2 domains separately - obviously they do not align and thus is incorrect.

mySequences <- readAAStringSet("ESRRB_1.fasta")

myFirstAlignment <- msa(mySequences)

myFirstAlignment

The results were:

myFirstAlignment CLUSTAL 2.1

Call: msa(mySequences)

MsaAAMultipleAlignment with 4 rows and 180 columns aln names 1 IKALTTLCDLADRELVVIIGWAKHIPGFSSLSLGDQMSLLQ...DYELSQRHEEPWRTGKLLLTLPLLRQTAAKAVQHFYSVKLQ Hormone_recep_1_E... [2] IKALTTLCDLADRELVVIIGWAKHIPGFSSLSLGDQMSLLQ...DYELSQRHEEPWRTGKLLLTLPLLRQTAAKAVQHFYSVKLQ Hormone_recep_1_E... [3] -----RLC--------LVCG--DIASGYH---YGVASCEAC...----------------------------------------- zf-C4_1_ENST00000... [4] -----RLC--------LVCG--DIASGYH---YGVASCEAC...----------------------------------------- zf-C4_1_ENST00000... Con ??????LC???????????G??????G??????G???????...????????????????????????????????????????? Consensus

I personally find your post is very difficult to follow, possibly due to a lack of domain expertise. I would suspect others feel the same. I'd recommend that you a) provide reproducible data (use `dput(head(your_data))` and b) your desired output. If you use expressions such as `BLAST_p`, it may be helpful to clarify what this stands for, what you have tried and where you are stuck. — coffeinjunky, Aug 05 '21 at 12:43
It would also be helpful to show the code you've tried so far and to explain why it doesn't solve the problem. — Limey, Aug 05 '21 at 12:48
Even if it is an old thread, you may find usefull info: https://stackoverflow.com/questions/4497747/how-to-perform-basic-multiple-sequence-alignments-in-r (if you have not already find it ^^) — Paul, Aug 05 '21 at 13:28
@coffeinjunky, I have tried to correct it, I am new to all this so may not be ideal. — MaheJaan, Aug 05 '21 at 13:38

R function for pairwise comparison of protein sequences

0 Answers0