-1

I have two tab-separated with header text file.

File1 has 11 columns

Col1    Col2    Col3    Col4    Col5    Col6    Col7    Col8    Col9    Col10   Col1

1   NH1 NH1 Unknown 149578  B2                  202410200023_R02C02

2   NH2 NH2 Unknown 149578  A4                  202410200023_R04C01

10  NH10    NH10    Unknown 149578  A1                  202410200023_R01C01

11  BU51    BU51    Unknown 149578  B3                  202410200023_R03C02

12  BU52    BU52    Unknown 149578  A6                  202410200023_R06C01

file2 has 4, 5 or more columns.

Col1    Col2    Col3    Col4    Col5

BU51    149578  BU51    Unknown 1

BU52    149578  BU52    Unknown 1

NH1 149578  NH1 Unknown 1

NH2 149578  NH2 Unknown 1

I try to make a output with column 2 to 4 from file 1 and column 4 and 5 from file2. But the col2 or col3 from file1 need to match with col1 or col2 from file2. The columns of output may have new locations. Such as column2 from file1 may be located at column1 in output

I try to find the answer from How to compare and merge multiple files? I am not sure why the line foreach my $key (keys %ref){ push( @{$ref{$key} }, $current{key}} can not work well.

Victor.H
  • 157
  • 1
  • 8
  • 1
    Please show the code that you need us to help you with: we can't do anything without seeing it. Are there really blank lines in the data file? And are there really column headers `Col1`, `Col2` etc? – Borodin Sep 16 '18 at 09:49
  • 2
    One way: Use [DBD::CSV](https://metacpan.org/pod/DBD::CSV) to treat the files as tables in a database, and join them with the appropriate SQL statement. – Shawn Sep 16 '18 at 09:51
  • 1
    @Shawn: So write up an answer using that module. – Borodin Sep 16 '18 at 12:04
  • @Borodin: yes, there are blank lines in the data and all datasets have header. – Victor.H Sep 17 '18 at 00:07

1 Answers1

1

Your join critera is kind of confusing file2.col2 has nothing in common with file1.col2 or file1.col3... did you mean file2.col3? And you don't have any sample output so we know exactly what you're looking for.

But I think this'll work. It relies on the aforementioned DBD::CSV module for reading your tsv files and joining them, and DBIx::TSV for formatting the output. Both are available through your favorite CPAN client.

#!/usr/bin/perl
use warnings;
use strict;
use DBI;
use DBIx::TSV;

my $dbh = DBI->connect("dbi:CSV:", undef, undef,
                       {csv_tables => {
                                       "file1" => { f_file => "file1.tsv" },
                                       "file2" => { f_file => "file2.tsv" }
                                      },
                        csv_sep_char => "\t"
                       }) or die $DBI::errstr;

print $dbh->selectall_tsv(<<EOQ);
SELECT f1.col2 AS Col1, f1.col3 AS Col2, f1.col4 AS Col3, f2.col4 AS Col4, f2.col5 AS Col5
FROM file1 AS f1
   , file2 AS f2
WHERE f1.col2 = f2.col1 OR f1.col2 = f2.col3 OR f1.col3 = f2.col1 OR f1.col3 = f2.col3
EOQ
$dbh->disconnect;

which prints out

Col1    Col2    Col3    Col4    Col5
NH1 NH1 Unknown Unknown 1
NH2 NH2 Unknown Unknown 1
BU51    BU51    Unknown Unknown 1
BU52    BU52    Unknown Unknown 1

(SO formatting does not play well with tabs)

There are other ways of doing this, I'm sure, but when you start talking about joining lines of different files based on common fields, and then only displaying certain fields... that just screams relational data and SQL is my usual way of handling it even without an actual database.

Shawn
  • 47,241
  • 3
  • 26
  • 60