Assumptions:
- the csv file (with 2 columns: domain name + ip address) uses the comma (
,
) as a delimiter (this isn't shown in the sample data but the OP mentioned this in a comment)
- no mention is made of any requirements to sort the final output in any particular order so I'll print the output in the same order as:
- the ips occur in the first file
- the domain addresses occur in the csv file
- no sample was provided for the first file so I'm going to assume a single ip address per line
- I'm not going to worry about the possibility of an ip address showing up more than once in the first file (ie, we'll just repeatedly print the same matching domain names each time the ip address shows up in the first file)
- any entries in either file without a 'match' in the other file will not show up in the final output
Sample data:
$ cat domain.dat
example.com,1.1.1.1
example3.com,3.4.5.6
example5.com,11.12.13.14
exampleX.com,99.99.99.99 # no matches in ip.dat
example2.com,1.1.1.1
example4.com,11.12.13.14
$ cat ip.dat
1.1.1.1
2.2.2.2 # no matches in domain.dat
3.4.5.6
7.8.9.10 # no matches in domain.dat
11.12.13.14
1.1.1.1 # repeat of an ip address
This awk
solution starts by processing domain.dat
to populate an array (domains[<ipaddress>]=<domainaddress>[,<domainaddress]*
), it then processes ip.dat
to determine which domain addresses to print to stdout:
awk -F "," '
# first file: keep track of the longest domain address; to be used by printf
NR==FNR { if (length($1) > maxlen) { maxlen=length($1) } }
# first file: if the ip address is already an index in our array then append the current domain address to the array element; skip to next of input
(NR==FNR) && ($2 in domains) { domains[$2]=domains[$2]","$1 ; next }
# first file: first time we have seen this ip address so create a new array element, using the ip address as the array index; skip to next line of input
NR==FNR { domains[$2]=$1 ; next}
# second file: if the ip address is an index in our array ...
# split the domain address(es), delimited by comma, into a new array named "arr" ...
( $1 in domains ) { split(domains[$1],arr,",")
# set the output line suffix to the ip address
sfx=$1
# loop through our domain addresses, appending the ip address to the end of the first line; after we print the first domain
# address + ip address, reset suffix to the empty string so successive printfs only display the domain address;
# the "*" in the format string says to read the numeric format from the input parameters - "maxlen" in this case
for (i in arr) { printf "%-*s %s\n",maxlen,arr[i],sfx ; sfx="" }
}
' domain.dat ip.dat
NOTE: The embedded comments can be removed to reduce the clutter.
Results of running the above:
example.com 1.1.1.1
example2.com
example3.com 3.4.5.6
example5.com 11.12.13.14 # example5.com comes before example4.com in domain.dat
example4.com
example.com 1.1.1.1 # repeated because 1.1.1.1 was repeated in ip.dat
example2.com