Suggestions on Analyzing Protein Sequences Similarity

Question

I want to write code to analyze short protein sequences and determine their similarity. I have no reference sequence but rather I want to write some sort of for loop to compare them all to each other to see how many duplicate sequences I have, as well as regions where they are similar.

I currently have all of their sequences in a csv.

I have taken a bioinformatics course and have done something similar with Illumina sequencing data but I started from an SRA table and had fasta files.

Also, I am trying to use CD hit but but I am running into problems with the makefile and the compatibility of my compiler. I installed homebrew to get around the issue but I am still running into the problem and the make CXX=g++-9 CC=gcc-9 comand won't work.

I was wondering if there was more update to the method than CD-Hit because I have noticed that no one has really used CD Hit since 2020.

Also the only coding languages I know are R and Shell but I am currently learning Python.

The biopython package may help? For example: http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec99 — slothrop, Feb 07 '23 at 18:58
Please provide enough code so others can better understand or reproduce the problem. — Community, Feb 07 '23 at 21:11
When you say that you 'have no reference sequence' are you saying that you don't know if the sequences are homologous or not? Are these reads? — Jamie, Feb 14 '23 at 00:49

score 0 · Answer 1 · answered Feb 08 '23 at 17:33

0

https://bioinfo.lifl.fr/yass/index.php I have used it for SARS-CoV-2, found similarity to many viruses

answered Feb 08 '23 at 17:33

player777

131
4

Suggestions on Analyzing Protein Sequences Similarity

1 Answers1