0

Hi so basically I have a 'temp' text file that I'm using that has a long list of various email addresses (some repeats). What I'm trying to output is the email addresses in order of highest frequency and then the total number of unique email addresses at the end.

awk '{printf "%s %s\n", $2, $1} END {print "total "NR}' temp | sort -n | uniq -c -i

So far I got the output I wanted except for the fact that it's not ordered in terms of highest frequency. Instead, it's in alphabetical order.

I've been stuck on this for a few hours now and have no idea why. I know I probably did something wrong but I'm not sure. Please let me know if you need more information and if the code I provided was not the problem. Thank you in advance.

edit: I've also tried doing sort -nk1 (output has frequency in first column) and even -nk2

edit2: Here is a sample of my 'temp' file

aol.com
netscape.net
yahoo.com
yahoo.com
adelphia.net
twcny.rr.com
charter.net
yahoo.com

edit 3:

expected output:

33 aol.com
24 netscape.net
18 yahoo.com
5 adelphia.net
4 twcny.rr.com
3 charter.net
total 6

(no repeat emails, 6 total unique email addresses)

Daniel
  • 21
  • 4
  • 2
    Show us a short but representative sample from the file `temp`. – John1024 Oct 20 '16 at 01:37
  • aol.com netscape.net yahoo.com yahoo.com adelphia.net twcny.rr.com charter.net yahoo.com here is a short sample edit nvm I will update my original post – Daniel Oct 20 '16 at 03:00
  • 2
    @Daniel it is better to add expected output as well for clarity – Sundeep Oct 20 '16 at 03:02
  • ok I added that too! – Daniel Oct 20 '16 at 03:05
  • If your input file contains only _1_ column, why are you trying to print the _2nd_ column in your `awk` command? – mklement0 Oct 20 '16 at 03:06
  • 1
    Your expected output neither matches your sample input, nor the stated intent of the solution you're seeking (highest frequency first). – mklement0 Oct 20 '16 at 03:08
  • I think OP just added expected output for entire input file rather than just temp given here.. and expected output seems to be following same order in which the emails first occur, not alphabetic, not numeric, etc – Sundeep Oct 20 '16 at 03:10
  • 1
    @Sundeep That's a good _guess_, but I want to guide the OP toward providing an [MCVE (Minimal, Complete, and Verifiable Example)](http://stackoverflow.com/help/mcve). Also note that output doesn't even match the "output is the email addresses _in order of highest frequency_" requirement. – mklement0 Oct 20 '16 at 03:14
  • Sorry, I definitely misunderstood how my line of code was supposed to work. When I played around with uniq I saw that it output 2 columns so I thought that if I piped it with the awk command I'd have to work with that. Wrong thinking I assume now that you pointed that out. I also don't think I understood correctly how uniq -c -i worked. I looked through the manual but I'm definitely understanding it wrong. I thought it would get rid of the repeats (like the mulitple yahoo's) but it didn't when I played around with it in the console. – Daniel Oct 20 '16 at 03:14
  • 2
    @Daniel: It's generally better to start your question _explaining the problem you're trying to solve_, followed by your (unsuccessful) solution attempt - which may or may not be on the right track _fundamentally_. Otherwise, you'll fall victim to the [XY problem](http://meta.stackexchange.com/a/66378/248777). – mklement0 Oct 20 '16 at 03:16
  • @Sundeep sorry I updated the expected output. Had a brain lapse there. – Daniel Oct 20 '16 at 03:19
  • Daniel: It's still not _internally_ consistent with the sample input, but it's clearer now - it sounds like @hmedia1's answer has the right solution. – mklement0 Oct 20 '16 at 03:21

3 Answers3

2

Sample input modified to include an email with two instances

$ cat ip.txt 
aol.com
netscape.net
yahoo.com
yahoo.com
adelphia.net
twcny.rr.com
netscape.net
charter.net
yahoo.com

Using perl

$ perl -lne '
$c++ if !$h{$_}++;
END
{
    @k = sort { $h{$b} <=> $h{$a} } keys %h;
    print "$h{$_} $_" foreach (@k);
    print "total ", $c;
}' ip.txt
3 yahoo.com
2 netscape.net
1 adelphia.net
1 charter.net
1 aol.com
1 twcny.rr.com
total 6
  • $c++ if !$h{$_}++ increment counter for unique input lines, increment hash value with input line as key. Default initial value is 0 for both
  • After processing all input lines:
    • @k = sort { $h{$b} <=> $h{$a} } keys %h get keys sorted by descending numeric values of hash
    • print "$h{$_} $_" foreach (@k) print each hash value and key based on sorted keys @k
    • print "total ", $c print total unique lines


Can be written in single line if preferred:

perl -lne '$c++ if !$h{$_}++; END{@k = sort { $h{$b} <=> $h{$a} } keys %h; print "$h{$_} $_" foreach (@k); print "total ", $c}' ip.txt


Reference: How to sort perl hash on values and order the keys correspondingly

Community
  • 1
  • 1
Sundeep
  • 23,246
  • 2
  • 28
  • 103
2

In Gnu awk using @Sundeep's data:

$ cat program.awk
{ a[$0]++ }                                # count domains 
END {
    PROCINFO["sorted_in"]="@val_num_desc"  # sort in desc order in for loop
    for(i in a) {                          # this for in desc order
        print a[i], i 
        j++                                # count total
    } 
    print "total", j
}

Run it:

$ awk -f program.awk ip.txt
3 yahoo.com
2 netscape.net
1 twcny.rr.com
1 aol.com
1 adelphia.net
1 charter.net
total 6
James Brown
  • 36,089
  • 7
  • 43
  • 59
1

Updated / Summary

Summarising a few tested approaches here for this handy sorting tool:

Using bash (In my case v4.3.46)

sortedfile="$(sort temp)" ; countedfile="$(uniq -c <<< "$sortedfile")" ; uniquefile="$(sort -rn <<< "$countedfile")" ; totalunique="$(wc -l <<< "$uniquefile")" ; echo -e "$uniquefile\nTotal: $totalunique"

Using sh/ash/busybox (Though they aren't all the same, they all worked the same for these tests)

time (sort temp > /tmp/sortedfile ; uniq -c /tmp/sortedfile > /tmp/countedfile ; sort -rn /tmp/countedfile > /tmp/uniquefile ; totalunique="$(cat /tmp/uniquefile | wc -l)" ; cat /tmp/uniquefile ; echo "Total: $totalunique")

Using perl (see this answer https://stackoverflow.com/a/40145395/3544399)

perl -lne '$c++ if !$h{$_}++; END{@k = sort { $h{$b} <=> $h{$a} } keys %h; print "$h{$_} $_" foreach (@k); print "Total: ", $c}' temp

What was tested

A file temp was created using a random generator:

  • @domain.com was different in the unique addresses
  • Duplicated addresses were scattered
  • File had 55304 total addresses
  • File has 17012 duplicate addresses

A small sample of the file looks like this:

24187@9674.com
29397@13000.com
18398@27118.com
23889@7053.com
24501@7413.com
9102@4788.com
16218@20729.com
991@21800.com
4718@19033.com
22504@28021.com

Performance:

For the sake of completeness it's worth mentioning the performance;

perl:               sh:                 bash:

Total: 17012        Total:    17012     Total:    17012

real    0m0.119s    real    0m0.838s    real    0m0.973s
user    0m0.061s    user    0m0.772s    user    0m0.894s
sys     0m0.027s    sys     0m0.025s    sys     0m0.056s

Original Answer (Counted total addresses and not unique addresses):

tcount="$(cat temp | wc -l)" ; sort temp | uniq -c -i | sort -rn ; echo "Total: $tcount"
  • tcount="$(cat temp | wc -l)": Make Variable with line count
  • sort temp: Group email addresses ready for uniq
  • uniq -c -i: Count occurrences allowing for case variation
  • sort -rn: Sort according to numerical occurrences and reverse the order (highest on top)
  • echo "Total: $tcount": Show the total addresses at the bottom

Sample temp file:

john@domain.com
john@domain.com
donald@domain.com
john@domain.com
sam@domain.com
sam@domain.com
bill@domain.com
john@domain.com
larry@domain.com
sam@domain.com
larry@domain.com
larry@domain.com
john@domain.com

Sample Output:

   5 john@domain.com
   3 sam@domain.com
   3 larry@domain.com
   1 donald@domain.com
   1 bill@domain.com
Total:       13

Edit: See comments below regarding use of sort

Community
  • 1
  • 1
hmedia1
  • 5,552
  • 2
  • 22
  • 27
  • I suggest not sorting entire lines unless you really mean to; to sort by a specific field, use `-k,` - note that even if only a single field is being sorted, both a start and a stop field must be specified (otherwise everything _starting_ with that field is sorted). – mklement0 Oct 20 '16 at 02:52
  • Now that we're clearer on the requirements: You're almost there, but the desired total count (of _unique_ email addresses) is not the count of _input_ lines, but the count of the lines output by `uniq -c`. – mklement0 Oct 20 '16 at 03:51
  • @mklement0 hmedia1 Hi, I worked out these 2 lines of code and got my desired output: tcount="$(uniq -c -i temp | wc -l)" ; sort temp | uniq -c -i | sort -rn; echo "Total : $tcount" However I'm still not completely sure why we sort twice. My assumption is that we sort the temp file to use the command uniq on it then sort it numerically. Am I on the right track? Also thank you guys for the explanations and help. – Daniel Oct 20 '16 at 04:02
  • @Daniel: `uniq -c` requires sorted input to work meaningfully, and it preserves the input sort order in its output, prefixed with the frequency count. Since you want the output sorted by descending frequency count instead, you need to sort again, this time by the 1st field, in reverse order. – mklement0 Oct 20 '16 at 04:07
  • 1
    Ah I see. That makes a lot more sense. Thank you for the explanation and I'll definitely make it more easier on you guys next time. A bit new around here but the quality and amount of help given was awesome. Thanks again. – Daniel Oct 20 '16 at 04:41