Updated / Summary
Summarising a few tested approaches here for this handy sorting tool:
Using bash
(In my case v4.3.46)
sortedfile="$(sort temp)" ; countedfile="$(uniq -c <<< "$sortedfile")" ; uniquefile="$(sort -rn <<< "$countedfile")" ; totalunique="$(wc -l <<< "$uniquefile")" ; echo -e "$uniquefile\nTotal: $totalunique"
Using sh/ash/busybox
(Though they aren't all the same, they all worked the same for these tests)
time (sort temp > /tmp/sortedfile ; uniq -c /tmp/sortedfile > /tmp/countedfile ; sort -rn /tmp/countedfile > /tmp/uniquefile ; totalunique="$(cat /tmp/uniquefile | wc -l)" ; cat /tmp/uniquefile ; echo "Total: $totalunique")
Using perl
(see this answer https://stackoverflow.com/a/40145395/3544399)
perl -lne '$c++ if !$h{$_}++; END{@k = sort { $h{$b} <=> $h{$a} } keys %h; print "$h{$_} $_" foreach (@k); print "Total: ", $c}' temp
What was tested
A file temp
was created using a random generator:
- @domain.com was different in the unique addresses
- Duplicated addresses were scattered
- File had
55304
total addresses
- File has
17012
duplicate addresses
A small sample of the file looks like this:
24187@9674.com
29397@13000.com
18398@27118.com
23889@7053.com
24501@7413.com
9102@4788.com
16218@20729.com
991@21800.com
4718@19033.com
22504@28021.com
Performance:
For the sake of completeness it's worth mentioning the performance;
perl: sh: bash:
Total: 17012 Total: 17012 Total: 17012
real 0m0.119s real 0m0.838s real 0m0.973s
user 0m0.061s user 0m0.772s user 0m0.894s
sys 0m0.027s sys 0m0.025s sys 0m0.056s
Original Answer (Counted total addresses and not unique addresses):
tcount="$(cat temp | wc -l)" ; sort temp | uniq -c -i | sort -rn ; echo "Total: $tcount"
tcount="$(cat temp | wc -l)"
: Make Variable with line count
sort temp
: Group email addresses ready for uniq
uniq -c -i
: Count occurrences allowing for case variation
sort -rn
: Sort according to numerical occurrences and reverse the order (highest on top)
echo "Total: $tcount"
: Show the total addresses at the bottom
Sample temp file:
john@domain.com
john@domain.com
donald@domain.com
john@domain.com
sam@domain.com
sam@domain.com
bill@domain.com
john@domain.com
larry@domain.com
sam@domain.com
larry@domain.com
larry@domain.com
john@domain.com
Sample Output:
5 john@domain.com
3 sam@domain.com
3 larry@domain.com
1 donald@domain.com
1 bill@domain.com
Total: 13
Edit: See comments below regarding use of sort