How can I use grep with pipe to sort uniq lines from a gff file

Question

I am taking a fourth year bioinformatics course. In this current assignment, the prof has given us a gff file with all the miRNA genes in the human genome annotated as gene-MIR. We are supposed to use grep, along with a regular expression and other command-line tools to generate a list of unique miRNA names in the human genome. It seems fairly straight forward and I understand how to do most of it. But I am having trouble sorting the file and removing the repeated lines. We are supposed to do this in one command line, but I am having trouble doing so.

This is the grep command I used to generate a list of gene-MIR names:

grep -Eo "(\gene-MIR)\d*\w*" file.gff

But this only generates a huge list with multiple repeats. So I tried:

grep -Eo "(\gene-MIR)\d*\w*" file.gff > file2 | sort < file2 | uniq -c > file3

But this did not work either. I have tried many variations of the above, but I unsure of what to do next.

Can anyone offer any help/advice?

Add a few sample lines (say 10-20 lines with some duplicates) and expected output for that sample. Also, 1) `grep -E` doesn't support `\d` 2) Use `grep '..' | sort -u > op_file` or `grep '..' | sort | uniq -c > op_file` (don't create files in the middle) — Sundeep, Sep 25 '21 at 06:29
Please add sample input (no descriptions, no images, no links) and your desired output for that sample input to your question (no comment). — Cyrus, Sep 25 '21 at 06:52

score 0 · Accepted Answer · answered Sep 25 '21 at 21:01

You can use

grep -o 'gene-MIR[[:alnum:]_]*' file.gff | sort -u > file3

Details:

-o - outputs matched texts only
gene-MIR[[:alnum:]_]* - regex matching gene-MIR and then any zero or more "word" chars, letters, digits or underscores (as \w is not supported universally)
sort -u sorts and keep only unique entries.

How can I use grep with pipe to sort uniq lines from a gff file

1 Answers1