0

I'm new to Python & here is my question

Write a program to read through the mbox-short.txt and figure out the distribution by hour of the day for each of the messages. You can pull the hour out from the 'From ' line by finding the time and then splitting the string a second time using a colon. From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 Once you have accumulated the counts for each hour, print out the counts, sorted by hour as shown below.

Link of the file:

http://www.pythonlearn.com/code/mbox-short.txt

This is my code:

name = raw_input("Enter file:")
if len(name) < 1 : name = "mbox-short.txt"
handle = open(name)

counts = dict()
for line in handle:
    if not line.startswith ("From "):continue
    #words = line.split()

    col = line.find(':')
    coll = col - 2
    print coll


    #zero = line.find('0')
    #one = line.find('1')
    #b = line[ zero or one : col ]
    #print b
    #hour = words[5:6]
    #print hour

    #for line in hour:
     #   hr = line.split(':')
      #  x = hr[1]

    for x in coll:
        counts[x] = counts.get(x,0) + 1

        for key, value in sorted(counts.items()):
            print key, value

My first try was with list splitting(Comments) and it didn't work as it considered the 0 & the 1 as the first & the second letter not the numbers second one was with line find (:) which is partially worked with minutes not with hours as required!!

First question

Why when I write line.find(:), it takes automatically the 2 numbers after?

Second question

Why when I run the program now, it gives an error TypeError: 'int' object is not iterable on line 26 ??

Third question

Why it considered 0 & 1 as first & second letters of the line not 0 & 1 numbers

Finally If possible please solve me this problem with a little of explanation please (with the same codes to keep my learning sequence)

Thank you...

  • 1
    [`string.find`](https://docs.python.org/2/library/string.html#string.find) returns a number, e.g. `"From 10:20:30".find(":")` returns 7 because the string contains first occurrence of character `:` at index 7, while `"From 10:20:30".find(":", 8)` returns 10 – Aprillion Jul 06 '17 at 17:20

2 Answers2

0

That's because str.find() returns an index of the found substring, not the string itself. Consequently, when you subtract 2 from it and then try to loop through it it will complain that you're trying to loop through an integer and raise a TypeError.

You can grab the whole time string as:

time_start = line.find(":")
if time_start == -1:  # not found
    continue
time_string = line[time_start-2:time_start+6]  # slice out the whole time string

You can then further split the time_string by : to get hours, minutes and seconds (e.g. hours, minutes, seconds = time_string.split(":", 2) just keep in mind that those will be strings, not integers), or if you just want the hour:

hour = int(line[time_start-2:time_start])

You can take it from there - just increase your dict value and when you're done with parsing the file sort everything out.

zwer
  • 24,943
  • 3
  • 48
  • 66
0

First question Why when I write line.find(:), it takes automatically the 2 numbers after?

str.find() return the first index of the character that you want to find. If your string is "From 00:00:00", it returns 7 as the first ':' is at index 7.

Second question Why when I run the program now, it gives an error TypeError: 'int' object is not iterable on line 26 ??

As have said above, it returns an int, which you cannot iterate

Third question

Why it considered 0 & 1 as first & second letters of the line not 0 & 1 numbers

I don't really understand what do you mean here. Anyway, as I understand, you try to find the first index which '0' or '1' occurs and assume that the first letter of hour? What about 8-11pm(start with 2)?

Finally If possible please solve me this problem with a little of explanation please (with the same codes to keep my learning sequence)

Sure, it will be like this:

for line in f:
    if not line.startswith("From "): continue

    first_colon_index = line.find(":")
    if first_colon_index == -1: # there is no ':'
        continue
    first_char_hour_index = first_colon_index - 2

    # string slicing
    # [a:b] get string from index a to b
    hour = line[first_char_hour_index:first_char_hour_index+2]

    hour_int = int(hour)

    # if key exist, increase by 1. If not, set to 1
    if hour_int in count:
         count[hour_int] += 1
    else:
         count[hour_int] = 1
# print hour & count, in sorting order
for hour in sorted(count):
   print hour, count[hour]

The part about string slicing can be confusing, you can read more about it at Python docs.

And you have to sure that: in the line, there is no other ":" or this method will fail as the first ":" will not be the one between hour and minute.

To make sure it works, it's better to use Regex. Something like:

for line in f:
    if not line.startswith("From"): continue

    match = re.search(r'^From.*?([0-9]{2,2}:[0-9]{2,2}:[0-9]{2,2})', line)
    if match:
        time = match.group(1) # hh:mm:ss
        hh = int(time.split(":")[0])
        # if key exist, increase by 1. If not, set to 1
        if hh in count:
             count[hh] += 1
        else:
             count[hh] = 1
# print hour & count, in sorting order
for hour in sorted(count):
   print hour, count[hour]
hunzter
  • 554
  • 4
  • 11
  • Thank you .. But I really need it solved without setdefault & with using sort if possible with out 04 3 06 1 07 1 09 2 10 3 11 6 14 1 15 2 16 4 17 2 18 1 19 1 – Sherif Ezzat Jul 09 '17 at 14:56
  • `setdefault` is just a short hand for if/else. Code updated. And by sort, do you mean printing the result? Added to the code too. – hunzter Jul 10 '17 at 16:15
  • This is really perfect for me and I have understand it very well, but I have a last question if you allow me I needed the output to be with zeros on the left like this 04 06 07 and so on instead of 4 6 7.. to get the grade for this assignment the question is why in this solution the left zeros gone??!! and how to make the output with these zeroes?? – Sherif Ezzat Jul 11 '17 at 15:57
  • Thank you, if you are satisfy please consider marking this answer as correct. What you need is call padding. You can achieve that by something like: `'%02d' % number`, which mean return a string with length of 2, pad 0 before the `number` if the number length is less than 2 characters. i.e: 4 => 04, 12 => 12. There are more details in other answer here: https://stackoverflow.com/a/339013/3444923 – hunzter Jul 12 '17 at 02:03
  • I did voted for it as the write answer & I tried to vote it up as well but it didn't as I don't have enough reputation .. '%02d' % ..... worked for me as it works with integers Thank you very much for your answers & explanation & for your patience :) – Sherif Ezzat Jul 12 '17 at 05:17