-1

I am trying to write python code to extract certain fields from elb logs but i am not able to find proper regex for all elb log fields like "user_agent" , request etc

like how to print pattern "POST https://example.com:443/api/pages/uuids/8ad6e82e-f86b-11ea-a68d-cbc99f85d247/updateUserHeartbeat HTTP/2.0" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36" from below log using generic regex

various elb fields are mentioned here https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-access-logs.html

sample regex which i got :

regex = r'([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) ([^ ]*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\s]+?)\" \"([^\s]+)\" \"([^ ]*)\" \"([^ ]*)\"'
line_split = re.split(regex, line)

sample log line from log file is as below

h2 2021-06-07T23:57:13.300250Z app/megapool-retool-app/dbb257b8adaa87cf 93.107.2.244:59799 - -1 -1 -1 302 - 3087 561 "POST https://example.com:443/api/pages/uuids/8ad6e82e-f86b-11ea-a68d-cbc99f85d247/updateUserHeartbeat HTTP/2.0" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:us-west-2:752180062774:targetgroup/megapool-retool-app/1665e090211d92fc "Root=1-6089b259-1c8c6bca3b1d7a895a21a694" "xyz.com" "arn:aws:acm:us-west-2:75218123456562774:certificate/b7a45f0c-3009-42c2-97b9-ab81a61d1b25" 0 2021-06-07T23:57:13.299000Z "authenticate" "-" "-" "-" "-" "-" "-"
Ibrahim
  • 798
  • 6
  • 26

2 Answers2

3

I managed to fix it by modifying fields list in my python code based on the earlier/original regex i used , because in code based on regex it was splitting client_ip and client_port as well so after fixing fields list , all worked fine

below is my code snippet This code snippet is useful to analyze elb log files , it can be further modified based on needs

import re

fields = [ "type",
"time",
"elb",
"client_ip",
"client_port",
"target_ip",
"target_port",
"request_processing_time",
"target_processing_time",
"response_processing_time",
"elb_status_code",
"target_status_code",
"received_bytes",
"sent_bytes",
"request_type",
"request_url",
"request_protocol",
"user_agent_browser",
"ssl_cipher",
"ssl_protocol",
"target_group_arn",
"trace_id",
"domain_name",
"chosen_cert_arn",
"matched_rule_priority",
"request_creation_time",
"actions_executed",
"redirect_url",
"lambda_error_reason",
"target_port_list",
"target_status_code_list",
"classification",
"classification_reason" ]



field = str(input("what is the field needed? "))
regex = r'([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) ([^ ]*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\s]+?)\" \"([^\s]+)\" \"([^ ]*)\" \"([^ ]*)\"'

def ParseLogFile(file):
    resultDict = {}

    with open(file, 'r') as log:
        line = log.readline()
        while line:
            line_split = re.split(regex, line)
            line_split = line_split[1:len(line_split) - 1]
            index = fields.index(field)
            val = line_split[index]
            resultDict.setdefault(val, 0)
            resultDict[val] += 1
            line = log.readline()
        return resultDict
if __name__ == '__main__':
    result=ParseLogFile("C:\\HOME\\2.log")
    print(result)
2

Your sample regex gets a lot of information you don't actually want. Limit it to what you want and use everything you know about your text.

import re

finder = re.compile(r"\"(\w{3,4}) (\S*) ([^\"]*)\" \"([^\"]*)\"")

with open("testlog.txt", "r") as fp:
    txt = fp.read()

for req_type, url, protocol, browser_details in finder.findall(txt):
    print(f"{req_type=}")
    print(f"{url=}")
    print(f"{protocol=}")
    print(f"{browser_details=}")
Lukas Schmid
  • 1,895
  • 1
  • 6
  • 18