I have 5,000 protein sequence .fasta files 2 different transcripts in each file, which are split by protein domain.
The file for one gene looks like this enter image description here
I would like to do a BLAST_p on the same domain sequences, and get an output for all the domains, for all the files. I can do this on Ensembl here, by copying each sequences in: https://blast.ncbi.nlm.nih.gov/Blast.cgi
The output looks like this: enter image description here
The closest way I have been able to code this is reading in the sequences into R, and using the msa() package, however, it treats the domains as individual sequences and does an alignment for all 4 sequences instead of the 2 domains separately - obviously they do not align and thus is incorrect.
mySequences <- readAAStringSet("ESRRB_1.fasta")
myFirstAlignment <- msa(mySequences)
myFirstAlignment
The results were:
myFirstAlignment CLUSTAL 2.1
Call: msa(mySequences)
MsaAAMultipleAlignment with 4 rows and 180 columns aln names 1 IKALTTLCDLADRELVVIIGWAKHIPGFSSLSLGDQMSLLQ...DYELSQRHEEPWRTGKLLLTLPLLRQTAAKAVQHFYSVKLQ Hormone_recep_1_E... [2] IKALTTLCDLADRELVVIIGWAKHIPGFSSLSLGDQMSLLQ...DYELSQRHEEPWRTGKLLLTLPLLRQTAAKAVQHFYSVKLQ Hormone_recep_1_E... [3] -----RLC--------LVCG--DIASGYH---YGVASCEAC...----------------------------------------- zf-C4_1_ENST00000... [4] -----RLC--------LVCG--DIASGYH---YGVASCEAC...----------------------------------------- zf-C4_1_ENST00000... Con ??????LC???????????G??????G??????G???????...????????????????????????????????????????? Consensus