How to use AWK to solve this 2 Table Data Join?

Question

I have 2 data tables as shown (they are 2 x tab-delimited files). I'm trying to populate the Table-2 Country column with the corresponding country from Table-1. Need to "join" from info in Table-2's Firstname field.

What's the best approach here, given the complexity of the data in the Table-2, Firstname column? Would other Mac tools be better to use than AWK e.g. Excel formulae, Perl, Filemaker etc?

TABLE1 (Input):

city_ascii  country iso2
Mavinga Angola  AO
Menongue    Angola  AO
Mucusso Angola  AO
Guines  Cuba    CU
Havana  Cuba    CU
Holguin Cuba    CU
Las Tunas   Cuba    CU
Manzanillo  Cuba    CU
Matanzas    Cuba    CU
Moron   Cuba    CU
Santa Clara Cuba    CU
Varadero    Cuba    CU

TABLE2 (Input):

Firstname
Fred, Havana
James, (Varadero, Cuba)
Jack (Cuba)
Harry Varadero, Cuba
Josh Cuba
Gary, Mavinga & Other, Angola
Jamie, (Angola)

TABLE2 (Result):

Firstname   Country
Fred, Havana  Cuba
James, (Varadero, Cuba) Cuba
Jack (Cuba) Cuba
Harry Varadero, Cuba    Cuba
Josh Cuba   Cuba
Gary, Mavinga & Other, Angola   Angola
Jamie, (Angola) Angola

============ Here is debugging info in answer to Ed's Qs below:

awk -F'\t' '{print NF"<"$1"><"$2"><"$3">"}' Table3.txt | cat -v

    1<city_ascii  country iso2><><>
    1<Mavinga Angola  AO><><>
    1<Menongue    Angola  AO><><>
    1<Mucusso Angola  AO><><>
    1<Guines  Cuba    CU><><>
    1<Havana  Cuba    CU><><>
    1<Holguin Cuba    CU><><>
    1<Las Tunas   Cuba    CU><><>
    1<Manzanillo  Cuba    CU><><>
    1<Matanzas    Cuba    CU><><>
    1<Moron   Cuba    CU><><>
    1<Santa Clara Cuba    CU><><>
    1<Varadero    Cuba    CU><><>

    ==============
    awk -F'\t' '{print NF"<"$1"><"$2"><"$3">"}' Table4.txt | cat -v

    1<Firstname><><>
    1<Fred, Havana><><>
    1<James, (Varadero, Cuba)><><>
    1<Jack (Cuba)><><>
    1<Harry Varadero, Cuba><><>
    1<Josh Cuba><><>
    1<Gary, Mavinga & Other, Angola><><>
    1<Jamie, (Angola)><><>

    ===============
    cat -v tst.awk

    BEGIN { FS=OFS="\t" }
    NR==FNR {
        map[$1] = $2
        map[$2] = $2
        next
    }
    FNR==1 {
        print
        FS=" "
        next
    }
    {
        orig = $0
        country = ""
        gsub(/[^[:alpha:]]/," ")
        for (i=NF; i>0; i--) {
            if ($i in map) {
                country = map[$i]
                break
            }
        }
        print orig, country
    }

    ===============
    awk -f tst.awk Table3.txt Table4.txt >output.txt

    Firstname
    Fred, Havana    
    James, (Varadero, Cuba) 
    Jack (Cuba) 
    Harry Varadero, Cuba    
    Josh Cuba   
    Gary, Mavinga & Other, Angola   
    Jamie, (Angola) 

    ================
    awk -F'\t' '{print NF"<"$1"><"$2"><"$3">"}' output.txt | cat -v

    1<Firstname><><>
    2<Fred, Havana><><>
    2<James, (Varadero, Cuba)><><>
    2<Jack (Cuba)><><>
    2<Harry Varadero, Cuba><><>
    2<Josh Cuba><><>
    2<Gary, Mavinga & Other, Angola><><>
    2<Jamie, (Angola)><><>

Please add your desired output for that sample input to your question. — Cyrus, Jul 10 '18 at 17:54
It is a decent question but I am not going to retype your example to answer it. Please provide your example as TEXT that one can easily copy. — dawg, Jul 10 '18 at 17:55
@Cyrus -- Table2 is the desired output. It shows the resulting "Country" column with data added. Thanks. — s2016, Jul 10 '18 at 18:07
Why do you need Table1, Firstname column already includes the country as the last word? — karakfa, Jul 10 '18 at 19:22
@karakfa -- apologies for the confusion. Table2/Firstname column could also have content e.g. of David, Havana. No country included there. Needs to be obtained from Table1. — s2016, Jul 10 '18 at 20:43
@Ed Morton -- Table2 above is my required output. Table2 but with a blank "Country" column is the starting point, i.e. I have 2 Tables as the starting point. "Harry Varadero" e.g. is the data I'm starting with in a "vague" format. — s2016, Jul 10 '18 at 20:46
@ed -- I will need to manually adjust the data after-the-fact for any issues e.g. Cuba, Kansas, USA. Can't be avoided in the input files unfortunately. Not expecting many issues. — s2016, Jul 10 '18 at 21:05

daxim · Answer 1 · 2018-07-10T18:53:33.363

6

use DBI qw();
require DBD::CSV;
use List::Util 1.45 qw(uniq);

chdir '/tmp'; # location of csv files
my $dbh = DBI->connect("dbi:CSV:", undef, undef, {
    f_ext => '.csv',
    csv_sep_char => "\t",
    RaiseError => 1,
}) or die "Cannot connect: $DBI::errstr";

for my $country (
    uniq map { $_->[0] }
    # sql distinct not implemented
    $dbh->selectall_array('select country from table1')
) {
    $dbh->do(
        'update table2 set Country = ? where Firstname like ' .
            $dbh->quote("%$country%"),
        {},
        $country
    );
}

edited Jul 10 '18 at 18:53

answered Jul 10 '18 at 18:13

daxim

39,270
4
65
132

Thanks. I'm a perl novice and am getting an error with your script. I added it to a perljoin.txt file and ran from its folder in Mac Terminal > perl perljoin.txt -----> Can't locate List/AllUtils.pm in @INC (you may need to install the List::AllUtils module) – s2016 Jul 10 '18 at 18:29
2

I have replaced it with List::Util, which you hopefully already have installed. – daxim Jul 10 '18 at 18:54
I've installed both List/AllUtils and also DBD::CSV using CPAN. I reran version 1 of the code and get this error message ----> Can't locate object method "selectall_array" via package "DBI::db" at perljoin.txt line 12. I get exactly the same error when I run Version 2. – s2016 Jul 10 '18 at 20:27
.....and when I try to install this: install DBI::db Reading '/Users/fred/.cpan/Metadata' Database was generated on Tue, 10 Jul 2018 15:29:02 GMT Warning: Cannot install DBI::db, don't know what it is. – s2016 Jul 10 '18 at 20:38
1

Tip: One usually wants `PrintError => 0` with `RaiseError => 1` – ikegami Jul 11 '18 at 01:20
1

Tip: Passing `f_dir => "/tmp"` to `connect` is better than `chdir '/tmp';` – ikegami Jul 11 '18 at 01:21
2

@s2016, DBI::db is the namespace used internally by DBI for database handles (`$dbh`). Try upgrading DBI (`cpan DBI`). Maybe your version doesn't have `$dbh->selectall_array`? – ikegami Jul 11 '18 at 01:25

dawg · Answer 2 · 2018-07-11T19:05:16.733

1

If I understand what you are doing, it is taking the first (city) and second column (country) of this \t delimited file:

city_ascii  country iso2
Mavinga Angola  AO
Menongue    Angola  AO
Mucusso Angola  AO
Guines  Cuba    CU
Havana  Cuba    CU
Holguin Cuba    CU
Las Tunas   Cuba    CU
Manzanillo  Cuba    CU
Matanzas    Cuba    CU
Moron   Cuba    CU
Santa   Clara   Cuba    CU
Varadero    Cuba    CU

And matching the string in this filetogether with this single column file:

Firstname
Fred, Havana, Cuba
James, (Varadero, Cuba)
Jack (Cuba)
Harry Varadero, Cuba
Josh Cuba
Gary, Mavinga & Other, Angola
Jamie, (Angola)

To produce the two column file in your example.

The awk does that:

awk -F '\t' 'FNR==NR{city[$1]=$2; ct[$2]; next}
             # ^^ FNR==NR means it is the first file; set city and country      
     FNR==1 {printf "%s\t%s\n", $0,"Country"; next}
     # ^^   second file, first line - print the header   
     {split($0, arr, /[^[:alpha:]]/)
      # ^ split word like things from paren, punctuation, etc
      for (e in arr) {s=arr[e]   # loop over those words
                      if (s in city) { printf "%s\t%s\n", $0,city[s]; next }
                      # ^ a city? print that
                      if (s in ct) { printf "%s\t%s\n", $0,s; next }}
                      # ^ a country? print that
                      }' file1 file2
Firstname   Country
Fred, Havana    Cuba
James, (Varadero, Cuba) Cuba
Jack (Cuba) Cuba
Harry Varadero, Cuba    Cuba
Josh Cuba   Cuba
Gary, Mavinga & Other, Angola   Angola
Jamie, (Angola) Angola

The next statements tell awk to go to the next line of the file.

edited Jul 11 '18 at 19:05

answered Jul 10 '18 at 20:50

dawg

98,345
23
131
206

Thanks. Will this gawk work for a Firstname field = Fred, Havana ? I've updated the OP for clarity. i.e. There is not always a country embedded in the Firstname field -- that's why I need to "Join" with my Table1. – s2016 Jul 10 '18 at 21:12
Correct about the input files. When I run the new GAWK I get no countries on the output ---------> Firstname Fred, Havana James, (Varadero, Cuba) Jack (Cuba) Harry Varadero, Cuba Josh Cuba Gary, Mavinga & Other, Angola Jamie, (Angola) Country – s2016 Jul 10 '18 at 22:30
1

@EdMorton: Habits die hard in late 50's – dawg Jul 10 '18 at 23:04
@dawg -- when I copy/pasted both input text from OP and rerun I get this error output ----> Firstname Country Fred, Havana James, (Varadero, Cuba) Jack (Cuba) Harry Varadero, Cuba Gary, Mavinga & Other, Angola Jamie, (Angola) – s2016 Jul 10 '18 at 23:15
It is likely that you file is not tab delimited then – dawg Jul 11 '18 at 00:09
I can try with | pipe-delimited files for input and output to test it. How should your script be changed? Thanks. – s2016 Jul 11 '18 at 00:21
If you mean using `|` to delimit fields, just change `\t` to `|` wherever you see it. – dawg Jul 11 '18 at 00:35
Thanks - your code works. After I clean-up the input files with both a | pipe-delimited (or tabs). Could you help me to understand your Awk code a little? – s2016 Jul 11 '18 at 16:50
@s2016: Great. I added comments. If this works for you, please consider accepting the answer. – dawg Jul 11 '18 at 18:15
I’d like to exec the awk via a text file from Mac bash command line and am having problems formatting it. Getting ‘ and {} positions wrong. – s2016 Jul 13 '18 at 17:51

score 1 · Accepted Answer · answered Jul 10 '18 at 21:16

1

It sounds like this might be what you're looking for:

$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==FNR {
    map[$1] = $2
    map[$2] = $2
    next
}
FNR==1 {
    print
    FS=" "
    next
}
{
    orig = $0
    country = ""
    gsub(/[^[:alpha:]]/," ")
    for (i=NF; i>0; i--) {
        if ($i in map) {
            country = map[$i]
            break
        }
    }
    print orig, country
}

$ awk -f tst.awk file1 file2
Firstname       Country
Fred, Havana    Cuba
James, (Varadero, Cuba) Cuba
Jack (Cuba)     Cuba
Harry Varadero, Cuba    Cuba
Josh Cuba       Cuba
Gary, Mavinga & Other, Angola   Angola
Jamie, (Angola) Angola

answered Jul 10 '18 at 21:16

Ed Morton

188,023
17
78
185

When I run this I get output of ----------> Firstname,Country "Fred, Havana", "James, (Varadero, Cuba)", Jack (Cuba), "Harry Varadero, Cuba", Josh Cuba, "Gary, Mavinga & Other, Angola", "Jamie, (Angola)", --------> i.e. missing the Country name. Thanks. – s2016 Jul 10 '18 at 21:52
I now get wrong output as follows -----> Firstname Fred, Havana James, (Varadero, Cuba) Jack (Cuba) Harry Varadero, Cuba Josh Cuba Gary, Mavinga & Other, Angola Jamie, (Angola) – s2016 Jul 10 '18 at 22:38
If any easier both input files can be | pipe-delimited as can the output file. – s2016 Jul 10 '18 at 22:41
Ed -- I just replicated the same error by doing the following: I copy/pasted from my OP above the "TABLE1 (Input)" into Table3.txt. Ditto for TABLE2 (Input) into Table4.txt. In that order. I ran: awk -f tst.awk Table3.txt Table4.txt >output.txt – s2016 Jul 10 '18 at 22:51
I'm on Mac High Sierra running GNU Awk 4.2.1, API: 2.0 (GNU MPFR 4.0.1, GNU MP 6.1.2). – s2016 Jul 10 '18 at 23:09
Done -- please see the OP. – s2016 Jul 11 '18 at 00:35
Thanks - you are correct and the script now works. Trying to understand the code (I am an Awk newbie). Can you explain this part? ---> { orig = $0 country = "" gsub(/[^[:alpha:]]/," ") for (i=NF; i>0; i--) { if ($i in map) { country = map[$i] break } } print orig, country } – s2016 Jul 11 '18 at 16:18
that saves the original input, then converts all non-alphanumeric characters in the line to blanks so all that's left is a series of alphanumeric strings/words, then loops backwards from the end stopping at the first one that appears to be the name of a country as stored in the map[] array. Then it prints the original input followed by the derived country. – Ed Morton Jul 12 '18 at 00:41
— rather than matching words from right to left can this be reversed? First words are more accurate matches. – s2016 Jul 13 '18 at 17:53
1

Of course. Just look at the loop `for (i=NF; i>0; i--)`, the syntax to flip it should be obvious `for (i=1; i<=NF; i++)`, but in the example you provided in your question, e.g. `James, (Varadero, Cuba)`, right to left is clearly the way to go to determine the country as otherwise you'll be guessing at countries based on peoples names when the country is right there in the input. – Ed Morton Jul 13 '18 at 18:03

How to use AWK to solve this 2 Table Data Join?

3 Answers3

Linked