5

I have a tab delimited file that looks like the following:

cluster.1   Adult.1
cluster.2   Comp.1
cluster.3   Adult.2
cluster.3   Pre.3
cluster.4   Pre.1
cluster.4   Juv.2
cluster.4   Comp.4
cluster.4   Adult.3
cluster.5   Adult.2
cluster.6   Pre.5

I would like to count the number of times an entry occurs in column one and then print that to a new column three so that the output would look like this.

cluster.1   Adult.1 1
cluster.2   Comp.1  1
cluster.3   Adult.2 2
cluster.3   Pre.3   2
cluster.4   Pre.1   4
cluster.4   Juv.2   4
cluster.4   Comp.4  4
cluster.4   Adult.3 4
cluster.5   Adult.2 1
cluster.6   Pre.5   1

In the end I plan to delete those rows from my file where column 3 equals 1 but figured it will probably be a two step process to do so. Thanks.

acalcino
  • 315
  • 1
  • 4

4 Answers4

5

Using join:

cut -f1 input | sort | uniq -c | sed 's/^ *\([0-9]*\) */\1\t/' | \
      join -t $'\t'  -1 2 -2 1 -o '2.1 2.2 1.1' - input

Output:

cluster.1   Adult.1 1
cluster.2   Comp.1  1
cluster.3   Adult.2 2
cluster.3   Pre.3   2
cluster.4   Pre.1   4
cluster.4   Juv.2   4
cluster.4   Comp.4  4
cluster.4   Adult.3 4
cluster.5   Adult.2 1
cluster.6   Pre.5   1
perreal
  • 94,503
  • 21
  • 155
  • 181
5

With awk you can read the file twice as follows:

$ awk 'NR==FNR {a[$1]++; next} {print $0, a[$1]}' file file
cluster.1   Adult.1 1
cluster.2   Comp.1 1
cluster.3   Adult.2 2
cluster.3   Pre.3 2
cluster.4   Pre.1 4
cluster.4   Juv.2 4
cluster.4   Comp.4 4
cluster.4   Adult.3 4
cluster.5   Adult.2 1
cluster.6   Pre.5 1

The first time is stated by NR==FNR and counts the item. The second time is the second {} block and prints the line plus the counter.

fedorqui
  • 275,237
  • 103
  • 548
  • 598
3

Perl solution:

#!/usr/bin/perl
use warnings;
use strict;


sub output {
    my $buffer_ref = shift;
    print "$_\t", 0 + @$buffer_ref, "\n" for @$buffer_ref;
}


my $previous_cluster = q();
my @buffer;

while (<>) {
    chomp;
    my ($cluster, $val) = split /\t/;
    if ($cluster ne $previous_cluster) {
        output(\@buffer);
        undef @buffer;
        $previous_cluster = $cluster;
    }
    push @buffer, $_;
}
# Do not forget to output the last cluster.
output(\@buffer);
choroba
  • 231,213
  • 25
  • 204
  • 289
2

A Bash solution using an associative array:

declare -A array

while read col1 col2 ; do
  ((array[$col1]++))
done < "$infile"

while read col1 col2 ; do
  echo -e "$col1\t$col2\t${array[$col1]}"
done < "$infile"

The output:

cluster.1       Adult.1 1
cluster.2       Comp.1  1
cluster.3       Adult.2 2
cluster.3       Pre.3   2
cluster.4       Pre.1   4
cluster.4       Juv.2   4
cluster.4       Comp.4  4
cluster.4       Adult.3 4
cluster.5       Adult.2 1
cluster.6       Pre.5   1
Fritz G. Mehner
  • 16,550
  • 2
  • 34
  • 41