I'm trying to find any blank user agents and traces of spoofed user agents in my apache access logs.
Here's a typical line from my Access Log: (with IP and domain redacted)
x.x.x.x - - [10/Nov/2012:16:48:38 -0500] "GET /YLHicons/reverbnation50.png HTTP/1.1" 304 - "http://www.example.com/newaddtwitter.php" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/534.7 ZemanaAID/FFFF0077"
For blank user agents I'm trying to do this:
awk -F\" '($6 ~ /^-?$/)' /www/logs/www.example.com-access.log | awk '{print $1}' | sort | uniq
For finding info about UA's I'm running this: (Gives me the amount of hits each unique UA has)
awk -F\" '{print $6}' /www/logs/www.example.com-access.log | sort | uniq -c | sort -fr
What can I do differently to make these commands stronger and more thought out, while giving me the best information I can to combat bots and other scums of the Internet?