-1

I want to sort input by number of appearances. However I don't want to delete either the unique or non-unique lines. For instance if I was given the following input:

Not unique
This line is unique
Not unique
Also not unique
Also unique
Also not unique
Not unique

I'd be looking for a set of pipelined commands that would output the following:

This line is unique
Also unique
Also not unique
Also not unique
Not unique
Not unique
Not unique

Thank you for any help that you can provide, I've been trying to use different combinations of unique and sort but can't figure it out, the solution would preferably be a one liner.

UPDATE: Thank you to all who responded, especially @batMan who's answer was exactly what I was looking for with commands with which I was familiar.

I'm still trying to learn how to pipeline and use multiple commands for seemingly simple tasks so is it possible for me to adapt his answer to work with 2 columns? For instance if the original input had been:

Notunique dog 
Thislineisunique cat 
Notunique parrot 
Alsonotunique monkey 
Alsounique zebra 
Alsonotunique beaver 
Notunique dragon

And I wanted the output to be sorted by first column like so:

Thislineisunique cat 
Alsounique zebra 
Alsonotunique monkey 
Alsonotunique beaver 
Notunique dog 
Notunique parrot 
Notunique dragon

Thank you all for being so helpful in advance!

Rahul Verma
  • 2,946
  • 14
  • 27
trysofter
  • 11
  • 1
  • can you show what have you tried so far? I'd use a short Python script, which can be pretty short using `collections.Counter` but this would not work for a pure `shell` solution. – norok2 Oct 07 '17 at 16:48

3 Answers3

1

The awk alone would be best for your updated question.

$ awk '{file[$0]++; count[$1]++; max_count= count[$1]>max_count?count[$1]:max_count;} END{ k=1; for(n=1; n<=max_count; n++){ for(i in count) if(count[i]==n) ordered[k++]=i} for(j in ordered) for( line in file) if (line~ordered[j]) print line; }' file

Alsounique zebra
Thislineisunique cat
Alsonotunique beaver
Alsonotunique monkey
Notunique parrot
Notunique dog
Notunique dragon

Explanation:

Part-1:

{file[$0]++; count[$1]++; max_count= count[$1]>max_count?count[$1]:max_count;}:

We are storing your input file in file array; The count array keeps track of counts of each unique first field based on which you want your file to be sorted. max_count keeps track of max count.

Part-2: Once awk finishes reading file, the content of count would be as following : (keys, values)

Alsounique 1
Notunique 3
Thislineisunique 1
Alsonotunique 2

Now our aim is to sort these keys by values as shown below. This is our key step as for each field/key/column 1 in below output we'll iterate over file array and print the lines that contains these keys and it will give us the final desired output.

Alsounique 
Thislineisunique 
Alsonotunique 
Notunique 

Below loop does the operation of storing the content of count array in another array called ordered in the sorted by values fashion. The content of ordered will be same as the output shown above.

for(n=1; n<=max_count; n++)
    { 
        for(i in count) 
            if(count[i]==n) 
            ordered[k++]=i
    } 

The final step: i.e to iterate over file array and print the lines in the order of the fields stored in ordered array.

for(field in ordered) 
    for( line in file) 
        if (line~ordered[field]) 
            print line; 
    }

Solution-2 :
The other possible solution would be using sort, uniq and awk/cut. But I won't recommend using this if your input file is very large as multiple pipes invokes multiple processes which slows down the whole operation.

$ cut -d ' ' -f1 file | sort | uniq -c | sort -n | awk 'FNR==NR{ordered[i++]=$2; next} {file[$0]++;} END{for(j in ordered) for( line in file) if (line~ordered[j]) print line;} ' - file
Alsounique zebra
Thislineisunique cat
Alsonotunique beaver
Alsonotunique monkey
Notunique parrot
Notunique dog
Notunique dragon

Previous solution (Before OP Edited the question)

This could be done using sort, uniq and awk like this :

$ uniq -c <(sort f1) | sort -n | awk '{ for (i=1; i<$1; i++){print}}1'
      1 Also unique
      1 This line is unique
      2 Also not unique
      2 Also not unique
      3 Not unique
      3 Not unique
      3 Not unique
Rahul Verma
  • 2,946
  • 14
  • 27
  • Thank you so much for this, it accomplishes exactly what I was looking for! I'm still trying to learn how to pipeline and use multiple commands for seemingly simple tasks so is it possible for me to adapt this to work with 2 columns? For instance if the original input had been Notunique 1 Thislineisunique 2 Notunique 3 Alsonotunique 4 Alsounique 5 Alsonotunique 6 Not unique 7 And I wanted the output to be sorted by first column like so Thislineisunique 2 Alsounique 5 Alsonotunique 4 Alsonotunique 6 Notunique 1 Notunique 3 Notunique 7 Where numbers just represent any text – trysofter Oct 08 '17 at 03:04
  • I edited the original post so the previous comment is in a better format, thank you again! – trysofter Oct 08 '17 at 03:18
  • the 1st column is redundant in your approach – RomanPerekhrest Oct 08 '17 at 05:17
  • @trysofter: Check the updated solution for your updated question. Let me know if you have any questions. – Rahul Verma Oct 08 '17 at 15:10
  • Thank you for helping me learn how to use these commands and for your in-depth explanation! However, I attempted to use your awk only solution and it groups the keys together properly however the output isn't based on ascending frequency of the first column as it was with the nonupdated question. The output on my machine for the given input was: Notunique dog Notunique dragon Notunique parrot Alsounique zebra Thislineisunique cat Alsonotunique beaver Alsonotunique monkey Solution 2 works but I am now wondering how to fix the awk function to also work, thanks again for your help! – trysofter Oct 08 '17 at 15:51
  • Not sure why it's so. I'm using gawk 4.1 and it's giving me expected output on your input. What's your awk version, try `awk --version` ? Which option have you tried ? – Rahul Verma Oct 08 '17 at 16:07
  • My version is GNU 4.0.2. I've tried both Option 1 and Option 2 and Option 2 works as intended. Option 1 might not be working because I'm trying to pipe input from another command rather than saving to a file and executing that way like in Option 2. However I did save the input to a file and try and run your command (Option 1) verbatim and still got the same output I had gotten when I just piped in the input from another command. – trysofter Oct 08 '17 at 16:41
  • It makes sense to not work if you're piping from another command for option 1 as slight changes would be needed to handle that. But if you're reading from another file then it should. If you've some unwanted/non-printable chars then you can check using `cat -A file` – Rahul Verma Oct 08 '17 at 17:03
0

uniq + sort + grep solution:

Extended inputfile contents:

Not unique
This line is unique
Not unique
Also not unique
Also unique
Also not unique
Not unique
Also not unique
Also not unique

Sorting the initial file beforehand:

sort inputfile > /tmp/sorted

uniq -u /tmp/sorted; uniq -dc /tmp/sorted | sort -n | cut -d' ' -f8- \
   | while read -r l; do grep -x "$l" /tmp/sorted; done

The output:

Also unique
This line is unique
Not unique
Not unique
Not unique
Also not unique
Also not unique
Also not unique
Also not unique

----------

You may also enclose the whole job into bash script:

#!/bash/bash

sort "$1" > /tmp/sorted   # $1 - the 1st argument (filename)
uniq -u /tmp/sorted

while read -r l; do
    grep -x "$l" /tmp/sorted
done < <(uniq -dc /tmp/sorted | sort -n | cut -d' ' -f8-)
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
  • This doesn't sort by number of appearances, it just puts unique lines (sorted) first, the non-unique lines sorted alphabetically, not by frequency. If the input had a few more lines `Also not unique` in it, it should show up at the end of the output, but it won't for this solution. – Benjamin W. Oct 07 '17 at 19:26
  • Just use this as your input file, each character on a separate line: `A A B B B B C C C`. Clearly, sorted by frequency, this would either have to become `A A C C C B B B B` or, sorted decreasingly, `B B B B C C C A A`, but it'll be the unmodified input. `uniq` doesn't rearrange its input, just filters it. – Benjamin W. Oct 07 '17 at 20:17
0

I would use awk to count the number of times each line occurs and then print them out (pre-pended by frequency) and sort numerically using sort -n:

awk 'FNR==NR{freq[$0]++; next} {print freq[$0],$0}' data.txt data.txt | sort -n

Sample Output

1 Also unique
1 This line is unique
2 Also not unique
2 Also not unique
3 Not unique
3 Not unique
3 Not unique

It's a Schwartzian transform really. If you want to discard the leading frequency column, just add | cut -d ' ' -f 2- to the end of the command.

Mark Setchell
  • 191,897
  • 31
  • 273
  • 432