8

How do I extract the IP address that occurs 10 times within a one-second time interval?

In the following case:

241.7118.197.10

28.252.8

Community
  • 1
  • 1
Maria
  • 111
  • 5

2 Answers2

5

You could collect the data to dict where IP is key and value contains timestamps for given IP. Then every time when timestamp is added you could check if given IP has three timestamps within a second:

from datetime import datetime, timedelta
from collections import defaultdict, deque
import re

THRESHOLD = timedelta(seconds=1)
COUNT = 3

res = set()
d = defaultdict(deque)

with open('test.txt') as f:
    for line in f:
        # Capture IP and timestamp
        m = re.match(r'(\S*)[^\[]*\[(\S*)', line)
        ip, dt = m.groups()

        # Parse timestamp
        dt = datetime.strptime(dt, '%d/%b/%Y:%H:%M:%S:%f')

        # Remove timestamps from deque if they are older than threshold
        que = d[ip]
        while que and (dt - que[0]) > THRESHOLD:
            que.popleft()

        # Add timestamp, update result if there's 3 or more items
        que.append(dt)
        if len(que) >= COUNT:
            res.add(ip)

print(res)

Result:

{'28.252.89.140'}

Above reads the logfile containing the log line by line. For every line a regular expression is used to capture data in two groups: IP and timestamp. Then strptime is used to parse the time.

First group (\S*) captures everything but whitespace. Then [^\[]* captures everything except [ and \[ captures the final character before timestamp. Finally (\S*) is used again to capture everything until next whitespace. See example on regex101.

Once we have IP and time they are added to defaultdict where IP is used as key and value is deque of timestamps. Before new timestamp is added the old ones are removed if they are older than THRESHOLD. This assumes that log lines are already sorted by time. After the addition the length is checked and if there are COUNT or more items in the queue IP is added to result set.

niemmi
  • 17,113
  • 7
  • 35
  • 42
3

First step would be to parse data, you can do so with this:

data = [(ip, datetime.strptime(time, '%d/%b/%Y:%H:%M:%S:%f')) for (ip, time) in re.findall("((?:[0-9]{1,3}\.){3}[0-9]{1,3}).+?\[(.+?) -", text)]

where text is the input text.

This will return a list with a tuple for every entry. First element of tuple will be the ip address, second the date.

Next step is to see which ones happen in a 1 sec interval and have the same ip:

print set([a[0] for a in data for b in data for c in data if (datetime.timedelta(seconds=0)<a[1]-b[1]<datetime.timedelta(seconds=1)) and (datetime.timedelta(seconds=0)<a[1]-c[1]<datetime.timedelta(seconds=1)) and (datetime.timedelta(seconds=0)<b[1]-c[1]<datetime.timedelta(seconds=1))])

Output:

set(['28.252.89.140'])
Carles Mitjans
  • 4,786
  • 3
  • 19
  • 38