-2

I have a text file which has several lines of codons each line has a set of three nucleotide sequence , it can be either an A,T,G,C but only three of them in a line. (eg. ATC) now, I want to write a while loop that can read these lines and count them and give me the output the codon and the number of times it occurred in the file being the highest to the lowest.

you cant use awk in this loop but using only grep and uniq. Thanks

  • 2
    Why no awk? Is this some kind of a homework? Also, `sort` would be convenient. – choroba Nov 03 '19 at 21:08
  • 2
    `I want to write` Then do it. You can find much help online on how to [read a file line by line](https://mywiki.wooledge.org/BashFAQ/001) or like [counting unique lines](https://stackoverflow.com/questions/15984414/bash-script-count-unique-lines-in-file). If you want others to do the job for you, try freelancing sites, where you offer money for others work. – KamilCuk Nov 03 '19 at 21:11
  • Why use only grep and uniq? Why do you even need grep? – Timur Shtatland Nov 03 '19 at 21:39
  • Thats how I was asked to do it. So only grep and uniq. – Dharmanand Ravirajan Nov 03 '19 at 22:40
  • From your reply, plus the comments below the dash-o answer, your question now seems more complex. Could you please (a) show a simple example of the input (codons, other text) and the output you need, and (b) give some more details as to why exactly would someone only use grep and uniq, when other simpler and equally common tools exist. Especially because any solution with grep + uniq would be probably less efficient and harder for maintainers of your code than sort + uniq (which are very common). Or do you need to simply filter with `grep -P '^[ACGT]{3}$'` before `sort | uniq -c'` – Timur Shtatland Nov 04 '19 at 02:57
  • ok. the input file something like aaa ttt ata atc cta ccc ccc ccc – Dharmanand Ravirajan Nov 04 '19 at 03:33
  • the output i need is list the number of times these 'ccc' , 'ttt', 'aaa', 'atc'....and so on repeated. – Dharmanand Ravirajan Nov 04 '19 at 03:34

1 Answers1

2

You can combine grep (to filter lines that only have ATGC sequences, sort and uniq to count the distinct lines, then extra sort to order highest to lowest

grep '^[ATGC]\+$' | sort | |  uniq -c | sort -k1nr

This will work for reasonable size file (for sure for <1M lines). For larger files, consider awk/Perl/Python solution to avoid the overhead of sorting the complete file.

dash-o
  • 13,723
  • 1
  • 10
  • 37
  • Thanks for the reply. I know I can sort and uniq. I dont know how to use grep to search. usually if its a word or pattern then i can use grep -c 'xx'. In my case it could be an A, T, G or C and it can be only three of them per line. – Dharmanand Ravirajan Nov 03 '19 at 21:22
  • Do you mean that there are other lines in the file that need to be filtered fro the sort ? – dash-o Nov 03 '19 at 21:28
  • Yes. its a text file with several lines. I need to do parsing these and rank the words based on the number of times these words get repeated. – Dharmanand Ravirajan Nov 03 '19 at 22:34