Log analysis (top URLs accessed) with Python

Question

I'm a total newbie trying to use Python to analyze my company's log files. They have a different format so online log analyzers don't work well on them.

The format is as follows:

localtime time-taken x-cs-dns c-ip sc-status s-action sc-bytes
cs-bytes cs-method cs-uri-scheme cs-host cs-uri-port cs-uri-path
cs-uri-query cs-username cs-auth-group s-hierarchy s-supplier-name
rs(Content-Type) cs(Referer) cs(User-Agent) sc-filter-result
cs-categories x-virus-id s-ip

Example:

"[27/Feb/2012:06:00:01 +0900]" 65 10.184.17.23 10.184.17.23 200
TCP_NC_MISS 99964 255 GET http://thumbnail.image.example.com 80
/mall/shop/cabinets/duelmaster/image01.jpg - - -
DIRECT thumbnail.image.example.com image/jpeg - "Wget/1.12
(linux-gnu)" OBSERVED "RC_White_list;KC_White_list;Shopping" -
10.0.201.17

The main thing I want to do right now is to grab all the cs-host and cs-uri-path fields, concatenate them together (to get http://thumbnail.image.example.com/mall/shop/cabinets/duelmaster/image01.jpg in the above example), count the unique instances, and rank and spit them out according to number of accesses, to see the top urls. Is there a way to make Python treat the whitespaces like separate objects/columns and grab the 11th object, for example?

Another complication is our daily log files are HUGE (~15GB) and ideally I want this to take under 20mins if possible.

Niklas B.'s code is working nicely and I can print the top IPs, users etc.

Unfortunately I cannot get the program to print or write this to an external file or email. Currently my code looks like this and only the last line gets written to the file. What might be the problem?

for ip, count in heapq.nlargest(k, sourceip.iteritems(), key=itemgetter(1)): top = "%d %s" % (count, ip) v = open("C:/Users/guest/Desktop/Log analysis/urls.txt", "w")
print >>v, top

Niklas B. · Accepted Answer · 2012-03-09T15:32:32.757

Yes:

from collections import defaultdict
from operator import itemgetter

access = defaultdict(int)

with open("/path/to/file.log", "wb") as f:
  for line in f:
    parts = line.split() # split at whitespace
    access[parts[11] + parts[13]] += 1 # adapt indices here

# print all URLs in descending order
for url, count in sorted(access.iteritems(), key=lambda (_, c): -c):
  print "%d %s" % (count url)

# if you only want to see the top k entries:
import heapq
k = 10
for url, count in heapq.nlargest(k, access.iteritems(), key=itemgetter(1)):
  print "%d %s" % (count, url)

Untested. Another possibility is to use a Counter:

from collections import Counter
with open("/path/to/file.log", "wb") as f:
  counter = Counter(''.join(line.split()[11:14:2]) for line in f)

# print top 10 (leave argument out to list all)
for url, count in counter.most_common(10):
  print "%d %s" % (count, url)

By the way, the problem with the code writing the URLs to a file is that you reopen the file in every iteration, thus discarding the contents of the file every time. You should open the file outside the loop and only write inside.

Thank you. The top k entries works well on the log but not the print urls in descending order (lamba needs 2nd argument?). I am now trying to understand the script and adapt it for top users, top source IP and top source IP segment, and send this out as a daily email to our team. — Adrienne, Mar 05 '12 at 09:13
@Adrienne: I just looked back at this and saw that it can be solved even more elegantly using a `Counter`. If you're still interested, I added it as an update. — Niklas B., Mar 09 '12 at 15:33

Log analysis (top URLs accessed) with Python

1 Answers1