Python: how to Parse and check the time?

Question

How do I extract the IP address that occurs 10 times within a one-second time interval?

In the following case:

241.7118.197.10

28.252.8

niemmi · Accepted Answer · 2016-12-25T00:26:41.353

You could collect the data to dict where IP is key and value contains timestamps for given IP. Then every time when timestamp is added you could check if given IP has three timestamps within a second:

from datetime import datetime, timedelta
from collections import defaultdict, deque
import re

THRESHOLD = timedelta(seconds=1)
COUNT = 3

res = set()
d = defaultdict(deque)

with open('test.txt') as f:
    for line in f:
        # Capture IP and timestamp
        m = re.match(r'(\S*)[^\[]*\[(\S*)', line)
        ip, dt = m.groups()

        # Parse timestamp
        dt = datetime.strptime(dt, '%d/%b/%Y:%H:%M:%S:%f')

        # Remove timestamps from deque if they are older than threshold
        que = d[ip]
        while que and (dt - que[0]) > THRESHOLD:
            que.popleft()

        # Add timestamp, update result if there's 3 or more items
        que.append(dt)
        if len(que) >= COUNT:
            res.add(ip)

print(res)

Result:

{'28.252.89.140'}

Above reads the logfile containing the log line by line. For every line a regular expression is used to capture data in two groups: IP and timestamp. Then strptime is used to parse the time.

First group (\S*) captures everything but whitespace. Then [^\[]* captures everything except [ and \[ captures the final character before timestamp. Finally (\S*) is used again to capture everything until next whitespace. See example on regex101.

Once we have IP and time they are added to defaultdict where IP is used as key and value is deque of timestamps. Before new timestamp is added the old ones are removed if they are older than THRESHOLD. This assumes that log lines are already sorted by time. After the addition the length is checked and if there are COUNT or more items in the queue IP is added to result set.

Yes, `deque` of timestamps from where oldest items might be removed every time new timestamp is added. — niemmi, Dec 24 '16 at 01:03
is it possible to explain about the `r'(^[^\s]*)[^\[]*\[([^\s\]]*):(\d+)'` — Maria, Dec 24 '16 at 01:13
@niemmi Nice solution. I would add a variable with the number of occurs as a constant in the way you did with `THRESHOLD`. — Fomalhaut, Dec 24 '16 at 02:12
@Fomalhaut Thanks for the suggestion, updated the answer accordingly. — niemmi, Dec 24 '16 at 02:19

Carles Mitjans · Answer 2 · 2016-12-24T01:07:57.080

3

First step would be to parse data, you can do so with this:

data = [(ip, datetime.strptime(time, '%d/%b/%Y:%H:%M:%S:%f')) for (ip, time) in re.findall("((?:[0-9]{1,3}\.){3}[0-9]{1,3}).+?\[(.+?) -", text)]

where text is the input text.

This will return a list with a tuple for every entry. First element of tuple will be the ip address, second the date.

Next step is to see which ones happen in a 1 sec interval and have the same ip:

print set([a[0] for a in data for b in data for c in data if (datetime.timedelta(seconds=0)<a[1]-b[1]<datetime.timedelta(seconds=1)) and (datetime.timedelta(seconds=0)<a[1]-c[1]<datetime.timedelta(seconds=1)) and (datetime.timedelta(seconds=0)<b[1]-c[1]<datetime.timedelta(seconds=1))])

Output:

set(['28.252.89.140'])

edited Dec 24 '16 at 01:07

answered Dec 24 '16 at 00:48

Carles Mitjans

4,786
3
19
38

Which file? You only specified that input text. If that text is in a file, you should first read it. – Carles Mitjans Dec 24 '16 at 00:52
Using this two lines you will need two modules: `re` and `datetime` – Carles Mitjans Dec 24 '16 at 00:53

Python: how to Parse and check the time?

2 Answers2