I am working with a CSV that has the following structure:
"2012-09-01 20:03:15","http://example.com"
The data is a cleaned up dump of my browsing history. I am interested in counting the first five unique domains per a given day. Here is what I have so far:
from urlparse import urlparse
import csv
from collections import Counter
domains = Counter()
with open("history.csv") as f:
for row in csv.reader(f):
d = row[0]
dt = d[11:19]
dt = dt.replace(":","")
dd = d[0:10]
if (dt < "090000") and (dt > "060000"):
url = row[1]
p = urlparse(url)
ph = p.hostname
print dd + "," + dt + "," + ph
domains += Counter([ph])
t = str(domains.most_common(20))
With d, dt, and dd, I am separating the date and time. With the above example row, dt = 20:03:15, and dd = 2012-09-01. The "if (dt < "090000") and (dt > "060000")" is just to say that I am only interested in counting websites visited between 6am and 9am. How would I say "count only the first five websites that were visited before 6am, each day"? There are hundreds of rows for any given day, and the rows are in chronological order.