Extracting various types of IP addresses from multiple formatted sources

Question

So I have this IP blocking script, where i have 3 sources to get both IP addresses and address ranges from, get them all together into a single file, then import them with ipset and iptables with some code "borrowed" from OpenWrt's BanIP. It uses Awk to filter them through a regex, to get a clean output, then add some text before each ip address to create an iptables compatible hash.


# download sources
####################################################################################################

# manually added ips
##################################################

cat <<IPS > /tmp/ips.txt
#
# ip address 1
198.54.xxx.xxx
#
IPS

# sources from links (main)
##################################################

cat <<IPS2 | sed 's/#.*$//g' | tr "\n" " " | xargs curl --retry 5 -s >> /tmp/ips.txt && echo "Downloaded IP lists 1/2"
#
# DoH providers used to circumvent hosts blocking
https://raw.githubusercontent.com/dibdot/DoH-IP-blocklists/master/doh-ipv{4,6}.txt
# Full bogons
https://www.team-cymru.org/Services/Bogons/fullbogons-ipv{4,6}.txt
# Firehol level 1 and 2
https://iplists.firehol.org/files/firehol_level{1,2}.netset
# country ip allocations
https://stat.ripe.net/data/country-resource-list/data.json?resource=ng
# bots/spammers recent
https://iplists.firehol.org/files/firehol_abusers_1d.netset
#
IPS2

# sources from whois
##################################################

{
#
# social login providers
# facebook ip class
whois -h whois.radb.net -- '-i origin AS32934'
# twitter ip class
whois -h whois.radb.net -- '-i origin AS13414'
# apple ip class
whois -h whois.radb.net -- '-i origin AS714'
#
} | grep route >> /tmp/ips.txt && echo "Downloaded IP lists 2/2"

# create blocklists, awk arguments are taken from OpenWrt's BanIP
####################################################################################################

awk 'NR==1{print "create ipmaster hash:net family inet hashsize 750000 maxelem 1000000"}''/^(([0-9]{1,3}\.){3}[0-9]{1,3}(\/[0-9]{1,2})?)([[:space:]]|$)/{print "add ipmaster "$1}' /tmp/ips.txt | ipset restore -!
awk 'NR==1{print "create ipmaster6 hash:net family inet6 hashsize 750000 maxelem 1000000"}''/^([0-9a-fA-F]{0,4}:){1,7}[0-9a-fA-F]{0,4}(:\/[0-9]{1,2})?([[:space:]]|$)/{print "add ipmaster6 "$1}' /tmp/ips.txt | ipset restore -!

# import blocklists
iptables -v -I INPUT -m set --match-set ipmaster src -j DROP
ip6tables -v -I INPUT -m set --match-set ipmaster6 src -j DROP
echo ""$(ipset list ipmaster | wc -l)" blocked IPv4 IPs"
echo ""$(ipset list ipmaster6 | wc -l)" blocked IPv6 IPs"

# clean up
rm /tmp/ips.txt

The problem however is that the awk regex filter is not "catching" some ip addresses, or in some cases deleting the / address space termination thing.

For instance, it omits these fields altogether

"2001:4270::/32",

And anything else within a quote etc.

Do you suggest i correct the Awk with a better regex, and if so what should i use? I have exactly zero knowledge of regex rules

I do not need the ip addresses to be validated, iptables can take errors with no problem and i think it validates them anyway, so I actually need a regex that is extremely simple, as simple as possible.

Or should i use an external python library or perl? The script will have access to both.

I came up with a dirty temporary workaround: For the whois sources which output "route: 17.84.198.0/24 " I delete the first 8 characters and spaces with " | sed 's/^........//' | tr -d ' '". It results the pure ip: "17.84.198.0/24". This allows the awk regex to now identify it as an ip address. However it's way too unreliable for daily use :) — Hitman47, Apr 03 '21 at 01:23
And for ipv6 addresses, see [Regular expression that matches valid IPv6 addresses](https://stackoverflow.com/questions/53497/regular-expression-that-matches-valid-ipv6-addresses) — Luuk, Apr 03 '21 at 15:49
@Luuk How do i extract that regex with awk? For instance the (( )) in the first answer in there. Grep just extracts one address then stops. — Hitman47, Apr 03 '21 at 17:25
I am not an expert in regex's.. I was just finding that post on stackoverflow and was, a bit, surprised that such a long regex was needed to validate an IPv6 address. [regex101](https://regex101.com/r/7PdhNv/1) has a nice debugger for regex's, but I do not know which IPv6 address should match the first part of this regex. — Luuk, Apr 03 '21 at 18:50
@Luuk Yeah indeed, it's a long regex to even discover a v6 address, let alone validate it. For me ipv6 is too redundant, if one would have simply added 2 more columns on ipv4 format it would be enough for billions of devices, but yeah, it's what the industry decided :) Thank you anyway, i'll play around in regex101 or just do away with more dirty hacks :) — Hitman47, Apr 03 '21 at 19:00
The regex is working: `echo "2001:4270::/32" | awk '/^([0-9a-fA-F]{0,4}:){1,7}[0-9a-fA-F]{0,4}(:\/[0-9]{1,2})?([[:space:]]|$)/ { print "Matched:", $0 }'` but note that the double quotes are not part of the data in my example. You could just add them if you need to. — Ludovic Kuty, Apr 04 '21 at 06:36
The `/xx`, where xx is a number, is not part of the IP-address. It is used to identify a block of addresses. — Luuk, Apr 04 '21 at 07:53
@Luuk Is the /xx needed when doing ip blocking? Does it indicate other ip addresses, or just particular subnets within that address, to avoid excessive blocking of e.g. a vpn ip with a legitimate user on it? — Hitman47, Apr 05 '21 at 17:49
If you only want to block 1 address, you do not need the `/xx`. If you want to block something like facebook it's netter to block their compete range with 1 line that expresses the complete range. (if it was only 1 line for facebook ) — Luuk, Apr 05 '21 at 17:54

Extracting various types of IP addresses from multiple formatted sources

0 Answers0