I want to write code to analyze short protein sequences and determine their similarity. I have no reference sequence but rather I want to write some sort of for loop to compare them all to each other to see how many duplicate sequences I have, as well as regions where they are similar.
I currently have all of their sequences in a csv.
I have taken a bioinformatics course and have done something similar with Illumina sequencing data but I started from an SRA table and had fasta files.
Also, I am trying to use CD hit but but I am running into problems with the makefile and the compatibility of my compiler. I installed homebrew to get around the issue but I am still running into the problem and the make CXX=g++-9 CC=gcc-9 comand won't work.
I was wondering if there was more update to the method than CD-Hit because I have noticed that no one has really used CD Hit since 2020.
Also the only coding languages I know are R and Shell but I am currently learning Python.