-3

I have a csv of data with about 20 columns and each column will have more than one distinct value. Each row after the top one which is the header, is an individual data sample. I want to narrow the list down programatically so that I have the smallest number of data samples but each permutation of the column data is still represented.

Example data

SERIAL,ACTIVE,COLOR,CLASS,SEASON,SEATS
.0xb468d47cc9749fb862990426ff79aafb,T,GREEN,BETA,SUMMER,3
.0x847129b35bad62f5837eec30dc07a8a4,T,VIOLET,DELTA,SUMMER,1
.0x14b8df88fd6d6547e387f4caa99e52fd,F,ORANGE,ALPHA,SUMMER,4
.0x0a07fb97224caf79ea73d3fdd5495b8f,T,YELLOW,DELTA,WINTER,1
.0x7d747e689bb27b60198283d7b86db409,F,READ,DELTA,SPRING,2
.0x8247524df49bd19c4c316ee070a2dd4a,T,BLUE,GAMA,WINTER,2
.0x4103ed42af6e8e463708a6c629907fb5,T,YELLOW,ALPHA,SPRING,5
.0xc38deea7f02fbfbcdde1d3718d6decb4,T,YELLOW,DELTA,FALL,5
.0xa3d562edcf64e151d7de08ff8f8e0a94,F,VIOLET,DELTA,SUMMER,3
.0x9da58b3b05603325c24629f700c25c97,T,YELLOW,OMEGA,SPRING,4
.0xef0c0e75083229d654c9b111e3af8726,T,BLUE,GAMA,FALL,1
.0xa9022c8713f0aba2a8e1d20475a3104a,T,YELLOW,BETA,SUMMER,2
.0x5bb5f73e6030730610866cee80cfc2fb,F,ORANGE,BETA,FALL,5
.0xc202e5b43dd65525754fdc52b89e7375,T,BLUE,OMEGA,SUMMER,3
.0xfac9145af33a74aedae7cc0442426432,F,READ,BETA,SPRING,1
.0x457949648053f710b4f2d55cb237a91d,T,GREEN,BETA,SPRING,3
.0xed94d4df300f10f5c4dc5d3ac76cf9e5,F,VIOLET,ALPHA,WINTER,15
.0x870130135beed4cbbe06478e368b40b3,F,YELLOW,ALPHA,SPRING,3
.0x3b6f17841edb9651e732e3ffbacbe14a,T,GREEN,OMEGA,SUMMER,3
.0xfb30e054466b9e4cf944c8e48ff74c93,F,VIOLET,DELTA,SUMMER,8
.0xf741ddc71b4a667585acaa35b67dc6c9,F,BLUE,BETA,FALL,4
.0x60257ad6c299e466086cc6e5bb0a9a33,F,VIOLET,OMEGA,SPRING,1
.0xa5d208bfee5a27a7619ba07dcbdaeea0,T,GREEN,OMEGA,FALL,1
.0x53bc78fa8863e53e8c9fb11c5f6d2320,F,GREEN,GAMA,SPRING,2
.0x5a01253ce5cb0a6aa5213f34f0b35416,T,READ,BETA,WINTER,3
.0xaed9a979ba9f6fbf39895b610dde80f4,T,ORANGE,DELTA,WINTER,1
.0xe7769918e36671af77b5d3d59ea15cfe,T,ORANGE,OMEGA,FALL,4
.0x9e5327a1583332e4c56d29c356dbc5d2,T,INDEGO,ALPHA,WINTER,5
.0x79c5c70732ff04b4d00e81ac3a07c3b7,T,READ,OMEGA,FALL,5
.0x55f54d3c9cd2552e286364894aeef62a,F,READ,GAMA,SPRING,15
jbird468
  • 11
  • 5

1 Answers1

0

Use a hash to determine whether you have seen a particular column combination before, and then use that to determine whether to print a particular line.

Here is a rather basic example to demonstrate the idea:

filter.pl

#!/usr/bin/env perl

use warnings;
use strict;

die "usage: $0 file col1,col2,col3, ... coln\n" unless @ARGV;
my ($file, $columns) = @ARGV;
-f $file or die "$file does not exist!";
defined $columns or die "need to pass in columns!";

my @columns;
for my $col ( split /,/, $columns ) {
    die "Invalid column id $col" unless $col >= 1; # 1-based
    push @columns, $col - 1; # 0-based
}
scalar @columns or die "No columns!";

open my $fh, "<", $file or die "Unable to open $file : $!";
my %uniq;
while (<$fh>) {
    chomp();
    next if $. == 1; # Skip Header
    my (@data) = split /,/, $_; # Use Text::CSV for any non-trivial csv file
    my $key = join '|', @data[ @columns ]; # key will look like 'foo|bar|baz'
    if (not defined $uniq{ $key } ) {
        print $_ . "\n"; # Print the whole line with the first unique set of columns
        $uniq{ $key } = 1; # Now we have seen this combo
    }
}

data.csv

SERIAL,TRUTH,IN,PARALLEL
123,TRUE,YES,5
124,TRUE,YES,5
125,TRUE,YES,3
126,TRUE,NO,5
127,FALSE,YES,1
128,FALSE,YES,3
129,FALSE,NO,7

Output

perl filter.pl data.csv 2,3
123,TRUE,YES,5
126,TRUE,NO,5
127,FALSE,YES,1
129,FALSE,NO,7
xxfelixxx
  • 6,512
  • 3
  • 31
  • 38
  • That's it! I knew I had to keep track of what I had seen already but I did not know how to apply that to the data. Thanks! – jbird468 Mar 30 '17 at 15:47