I'm a total newbie trying to use Python to analyze my company's log files. They have a different format so online log analyzers don't work well on them.
The format is as follows:
localtime time-taken x-cs-dns c-ip sc-status s-action sc-bytes
cs-bytes cs-method cs-uri-scheme cs-host cs-uri-port cs-uri-path
cs-uri-query cs-username cs-auth-group s-hierarchy s-supplier-name
rs(Content-Type) cs(Referer) cs(User-Agent) sc-filter-result
cs-categories x-virus-id s-ip
Example:
"[27/Feb/2012:06:00:01 +0900]" 65 10.184.17.23 10.184.17.23 200
TCP_NC_MISS 99964 255 GET http://thumbnail.image.example.com 80
/mall/shop/cabinets/duelmaster/image01.jpg - - -
DIRECT thumbnail.image.example.com image/jpeg - "Wget/1.12
(linux-gnu)" OBSERVED "RC_White_list;KC_White_list;Shopping" -
10.0.201.17
The main thing I want to do right now is to grab all the cs-host and cs-uri-path fields, concatenate them together (to get http://thumbnail.image.example.com/mall/shop/cabinets/duelmaster/image01.jpg
in the above example), count the unique instances, and rank and spit them out according to number of accesses, to see the top urls. Is there a way to make Python treat the whitespaces like separate objects/columns and grab the 11th object, for example?
Another complication is our daily log files are HUGE (~15GB) and ideally I want this to take under 20mins if possible.
Niklas B.'s code is working nicely and I can print the top IPs, users etc.
Unfortunately I cannot get the program to print or write this to an external file or email. Currently my code looks like this and only the last line gets written to the file. What might be the problem?
for ip, count in heapq.nlargest(k, sourceip.iteritems(), key=itemgetter(1)): top = "%d %s" % (count, ip) v = open("C:/Users/guest/Desktop/Log analysis/urls.txt", "w")
print >>v, top