0

I want to write code to analyze short protein sequences and determine their similarity. I have no reference sequence but rather I want to write some sort of for loop to compare them all to each other to see how many duplicate sequences I have, as well as regions where they are similar.

I currently have all of their sequences in a csv.

I have taken a bioinformatics course and have done something similar with Illumina sequencing data but I started from an SRA table and had fasta files.

Also, I am trying to use CD hit but but I am running into problems with the makefile and the compatibility of my compiler. I installed homebrew to get around the issue but I am still running into the problem and the make CXX=g++-9 CC=gcc-9 comand won't work.

I was wondering if there was more update to the method than CD-Hit because I have noticed that no one has really used CD Hit since 2020.

Also the only coding languages I know are R and Shell but I am currently learning Python.

cosmomush
  • 1
  • 1
  • The biopython package may help? For example: http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec99 – slothrop Feb 07 '23 at 18:58
  • Please provide enough code so others can better understand or reproduce the problem. – Community Feb 07 '23 at 21:11
  • When you say that you 'have no reference sequence' are you saying that you don't know if the sequences are homologous or not? Are these reads? – Jamie Feb 14 '23 at 00:49
  • Also, exactly how short are the sequences that you have? – Jamie Feb 14 '23 at 00:50

1 Answers1

0

https://bioinfo.lifl.fr/yass/index.php I have used it for SARS-CoV-2, found similarity to many viruses

player777
  • 131
  • 4