2

I am working with a CSV that has the following structure:

"2012-09-01 20:03:15","http://example.com"

The data is a cleaned up dump of my browsing history. I am interested in counting the first five unique domains per a given day. Here is what I have so far:

from urlparse import urlparse
import csv
from collections import Counter

domains = Counter()

with open("history.csv") as f:
    for row in csv.reader(f):
        d = row[0]
        dt = d[11:19]
        dt = dt.replace(":","")
        dd = d[0:10]
        if (dt < "090000") and (dt > "060000"):
            url = row[1]
            p = urlparse(url)
            ph = p.hostname
            print dd + "," + dt + "," + ph
            domains += Counter([ph])
t = str(domains.most_common(20))

With d, dt, and dd, I am separating the date and time. With the above example row, dt = 20:03:15, and dd = 2012-09-01. The "if (dt < "090000") and (dt > "060000")" is just to say that I am only interested in counting websites visited between 6am and 9am. How would I say "count only the first five websites that were visited before 6am, each day"? There are hundreds of rows for any given day, and the rows are in chronological order.

Jonas
  • 121,568
  • 97
  • 310
  • 388
dongle
  • 599
  • 1
  • 4
  • 17

2 Answers2

3

I am interested in counting the first five unique domains per a given day.

import csv
from collections import defaultdict
from datetime import datetime
from urlparse import urlsplit

domains = defaultdict(lambda: defaultdict(int))
with open("history.csv", "rb") as f:
     for timestr, url in csv.reader(f):
         dt = datetime.strptime(timestr, "%Y-%m-%d %H:%M:%S")
         if 6 <= dt.hour < 9: # between 6am and 9am
            today_domains = domains[dt.date()] #  per given day
            domain = urlsplit(url).hostname
            if len(today_domains) < 5 or domain in today_domains:
               today_domains[domain] += 1 # count the first 5 unique domains

print(domains)
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • `datetime` is definitely the way to go -- I was working on my own when I had the dreaded "new answer posted" message -- but I think there are a few minor tweaks. I think `strptime` takes its arguments in the other order; as written, this breaks after the first day; and `most_common` returns a list of key, value pairs, not just keys. But these are all trivial. – DSM Sep 02 '12 at 01:09
  • @DSM: I've fixed argument order. It is unclear whether 'history.csv' contains more than one day. I've changed it to accept more than one day. I don't understand the output format in the question. – jfs Sep 02 '12 at 01:10
  • +1 now. As for the output, I just meant that '\n'.join(etc) wouldn't work 'cause it wasn't adding strings.. – DSM Sep 02 '12 at 01:17
  • Improvement on my messy code for sure, but this does not adress the question of counting the first five unique days. Perhaps I was not clear… there are over 90 days included in the data, each day with hundreds of entries. I only want to count the first five rows of each unique day. What you wrote returns the top 5 most common URLs, but otherwise the same results. – dongle Sep 02 '12 at 01:21
  • @J.F.Sebastian Thanks – This is close, but I'd like to still count the occurrences of a given hostname (only sampling the first five rows of each unique day). I tried reintroducing "Counter[urlsplit(url).hostname] += 1" into the above, but it returned "TypeError: 'type' object is not subscriptable". Sorry for being unclear! – dongle Sep 02 '12 at 01:45
  • @dongle: I've introduced the occurrences count. – jfs Sep 02 '12 at 01:59
  • @J.F.Sebastian edited to add the type of counting I intended. – dongle Sep 03 '12 at 22:57
  • @dongle: you could post it as a separate answer and accept it if it does what you need – jfs Sep 04 '12 at 04:53
1
import csv
from collections import defaultdict, Counter
from datetime import datetime
from urlparse import urlsplit

indiv = Counter()

domains = defaultdict(lambda: defaultdict(int))
with open("history.csv", "rb") as f:
    for timestr, url in csv.reader(f):
        dt = datetime.strptime(timestr, "%Y-%m-%d %H:%M:%S")
        if 6 <= dt.hour < 11: # between 6am and 11am
            today_domains = domains[dt.date()]
            domain = urlsplit(url).hostname
            if len(today_domains) < 5 and domain not in today_domains:
                today_domains[domain] += 1
                indiv += Counter([domain])
for domain in indiv:
    print '%s,%d' % (domain, indiv[domain])
dongle
  • 599
  • 1
  • 4
  • 17
  • what you have written is: how many *days* a domain is among the first 5 unique domains visited between 6am and 11am. btw, `indiv += Counter([domain])` is a perverse way of writing `indiv[domain] += 1`. You don't need `indiv`, you could use: `Counter(host for perday_domains in domains.viewvalues() for host in perday_domains)` instead or using `itertools.chain`: `Counter(chain.from_iterable(domains.viewvalues()))` – jfs Sep 04 '12 at 14:36