2

I am taking a fourth year bioinformatics course. In this current assignment, the prof has given us a gff file with all the miRNA genes in the human genome annotated as gene-MIR. We are supposed to use grep, along with a regular expression and other command-line tools to generate a list of unique miRNA names in the human genome. It seems fairly straight forward and I understand how to do most of it. But I am having trouble sorting the file and removing the repeated lines. We are supposed to do this in one command line, but I am having trouble doing so.

This is the grep command I used to generate a list of gene-MIR names:

grep -Eo "(\gene-MIR)\d*\w*" file.gff

But this only generates a huge list with multiple repeats. So I tried:

grep -Eo "(\gene-MIR)\d*\w*" file.gff > file2 | sort < file2 | uniq -c > file3

But this did not work either. I have tried many variations of the above, but I unsure of what to do next.

Can anyone offer any help/advice?

Cyrus
  • 84,225
  • 14
  • 89
  • 153
  • Add a few sample lines (say 10-20 lines with some duplicates) and expected output for that sample. Also, 1) `grep -E` doesn't support `\d` 2) Use `grep '..' | sort -u > op_file` or `grep '..' | sort | uniq -c > op_file` (don't create files in the middle) – Sundeep Sep 25 '21 at 06:29
  • Please add sample input (no descriptions, no images, no links) and your desired output for that sample input to your question (no comment). – Cyrus Sep 25 '21 at 06:52
  • Try `grep -Po 'gene-MIR\w*' file.gff | sort -u > file3` – Wiktor Stribiżew Sep 25 '21 at 09:43

1 Answers1

0

You can use

grep -o 'gene-MIR[[:alnum:]_]*' file.gff | sort -u > file3

Details:

  • -o - outputs matched texts only
  • gene-MIR[[:alnum:]_]* - regex matching gene-MIR and then any zero or more "word" chars, letters, digits or underscores (as \w is not supported universally)
  • sort -u sorts and keep only unique entries.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563