Compare the values of 2 files and merge the similar entries

Question

I'm new to bash scripting. I have 2 files: A text file containing a list of IP addresses, a .csv file that has 2 columns that the 2nd one contains IP addresses. I want to compare each line of the text file (each IP) with all elements of the 2nd column of .csv file. If there are multiple entries with the same IP in the .csv file, I want to merge their first fields in a row. For example:

      column1       column2
row1: example.com   1.1.1.1
row2: example2.com  1.1.1.1

I want to convert it to this:

      column1       column2
row1: example.com   1.1.1.1
      example2.com

I have written the values to .csv and .txt file, butI cannot figure out how to compare and merge the similar ones. I have found this command, but cannot understand how to apply it:

comm -- select or reject lines common to two files

On a side note: the handling of the dig output seems to me unnecessarily complicated, I'm curious where does `while read line; do echo $line,$(dig +short $line); done < URLs` fail you ? — Sorin, Feb 02 '20 at 10:14
please show the raw input file contents and the actual desired output (ie, remove the `columnN` and `rowN` labels as this is just confusing); as for the example output ... do you really want the 2x addresses on separate lines? if not, then update the question to show the actual output you're looking for — markp-fuso, Feb 02 '20 at 20:31
@markp yes I want to have a line with 2 columns. In the first column I want "example.com\n example2.com" and in the second column, I want the IP address. — helen, Feb 02 '20 at 21:13
what would the output look like if you have 2 lines with the same ip, but they're not contiguous, eg: `me.com 1.1.1.1 \n you.com 2.2.2.2 \n them.com 1.1.1.1`? do you need the input sorted (by IP), or do you want to keep the line ordering as is? — markp-fuso, Feb 02 '20 at 21:18
I need to write the URLs that have the same IP in one line, separated by \n. — helen, Feb 02 '20 at 21:26

markp-fuso · Answer 1 · 2020-02-03T00:12:47.210

Assumptions:

the csv file (with 2 columns: domain name + ip address) uses the comma (,) as a delimiter (this isn't shown in the sample data but the OP mentioned this in a comment)
no mention is made of any requirements to sort the final output in any particular order so I'll print the output in the same order as:
- the ips occur in the first file
- the domain addresses occur in the csv file
no sample was provided for the first file so I'm going to assume a single ip address per line
I'm not going to worry about the possibility of an ip address showing up more than once in the first file (ie, we'll just repeatedly print the same matching domain names each time the ip address shows up in the first file)
any entries in either file without a 'match' in the other file will not show up in the final output

Sample data:

$ cat domain.dat
example.com,1.1.1.1
example3.com,3.4.5.6
example5.com,11.12.13.14
exampleX.com,99.99.99.99    # no matches in ip.dat
example2.com,1.1.1.1
example4.com,11.12.13.14

$ cat ip.dat
1.1.1.1
2.2.2.2                     # no matches in domain.dat
3.4.5.6
7.8.9.10                    # no matches in domain.dat
11.12.13.14
1.1.1.1                     # repeat of an ip address

This awk solution starts by processing domain.dat to populate an array (domains[<ipaddress>]=<domainaddress>[,<domainaddress]*), it then processes ip.dat to determine which domain addresses to print to stdout:

awk -F "," '

# first file: keep track of the longest domain address; to be used by printf

NR==FNR                      { if (length($1) > maxlen) { maxlen=length($1) } }

# first file: if the ip address is already an index in our array then append the current domain address to the array element; skip to next of input

(NR==FNR) && ($2 in domains) { domains[$2]=domains[$2]","$1 ; next }

# first file: first time we have seen this ip address so create a new array element, using the ip address as the array index; skip to next line of input

NR==FNR                      { domains[$2]=$1             ; next}

# second file: if the ip address is an index in our array ...
# split the domain address(es), delimited by comma, into a new array named "arr" ...

( $1 in domains )            { split(domains[$1],arr,",")

                               # set the output line suffix to the ip address

                               sfx=$1

                               # loop through our domain addresses, appending the ip address to the end of the first line; after we print the first domain
                               # address + ip address, reset suffix to the empty string so successive printfs only display the domain address;
                               # the "*" in the format string says to read the numeric format from the input parameters - "maxlen" in this case

                               for (i in arr) { printf "%-*s   %s\n",maxlen,arr[i],sfx ; sfx="" }
                             }
' domain.dat ip.dat

NOTE: The embedded comments can be removed to reduce the clutter.

Results of running the above:

example.com    1.1.1.1
example2.com
example3.com   3.4.5.6
example5.com   11.12.13.14   # example5.com comes before example4.com in domain.dat
example4.com
example.com    1.1.1.1       # repeated because 1.1.1.1 was repeated in ip.dat
example2.com

Excellent answer with elaborate code and comprehensive explanations. — tshiono, Feb 05 '20 at 01:34

Simon Doppler · Answer 2 · 2020-02-02T10:09:15.490

0

You could list all IPs, then loop over the IPs to get the domains which correspond to that IP.

#! /bin/bash
set -euo pipefail

FILENAME="$1"

readarray -t ip_addresses<<<"$(cut -d ',' -f 2 "$FILENAME" | sort -u)"

for ip in "${ip_addresses[@]}" ; do
    readarray -t domains_for_ip<<<"$(grep "$ip" "$FILENAME" | cut -d ',' -f 1)"
    echo "${domains_for_ip[*]},$ip"
done

With an input file of

example.com,1.1.1.1
example2.com,1.1.1.1
example3.com,1.1.1.2

you would get

example.com example2.com,1.1.1.1
example3.com,1.1.1.2

This script does currently not check if the first argument ($1) is present and cannot check if the IPs are truly unique (it will consided 10.0.0.1 and 010.000.000.001 to be the two unique addresses). It does also assume there is no weirldy placed whitespace in the file.

edited Feb 02 '20 at 10:09

answered Feb 02 '20 at 10:01

Simon Doppler

1,918
8
26

What is the $FILENAME here? Which file name should I write here? – helen Feb 02 '20 at 11:19
`$FILENAME` contains the name of the file with both the hostname and IP address (in your original script, it is `subdomainIP.csv`). My bad, I should have changed it. – Simon Doppler Feb 02 '20 at 13:34

score 0 · Answer 3 · answered Feb 02 '20 at 10:32

0

Something like:

    while read IP; do 
       grep $IP subdomainIP.csv | \
           cut -f1 -d',' | \
           tr "\n" " "| \
           sed 's/ $//'; 
       echo ,$IP; 
    done < ipfile.txt

answered Feb 02 '20 at 10:32

Sorin

5,201
2
18
45

aborruso · Answer 4 · 2020-02-08T08:16:33.467

0

Using Miller (https://github.com/johnkerl/miller) starting from

example.com,1.1.1.1
example2.com,1.1.1.1
example3.com,1.1.1.2

and running

mlr --csv -N nest --implode --values --across-records -f 1 ipfile.txt >output.txt

you will have

example.com;example2.com,1.1.1.1
example3.com,1.1.1.2

If you want the URLs separated by \n, the command is

mlr --csv -N nest --implode --values --across-records --nested-fs "\n" -f 1 ipfile.txt >output.txt

edited Feb 08 '20 at 08:16

answered Feb 02 '20 at 10:36

aborruso

4,938
3
23
40

Where should I import "ipfile.txt" here? – helen Feb 02 '20 at 11:21
Why a downvote, without a comment? Does it not work? – aborruso Feb 03 '20 at 09:20
@helen did you try using Miller? Is not ok for you? – aborruso Feb 08 '20 at 08:17

Compare the values of 2 files and merge the similar entries

4 Answers4