0

Here is one line of log file :

41.42.50.xxx - - [09/Oct/2012:00:00:01 +0200] "GET http://www.xxxxxx.com/solutions-ar/solutions-1466.php HTTP/1.1" 200 10 "http://www.google.com.eg/url?dfasdfeaefdf" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229.79 Safari/537.4"

i want to parse the ip address, time, url, google url and browser to single line, i use (r'^(((2[0-4]\d|25[0-5]|[01]?\d\d?)\.){3}(2[0-4]\d|25[0-5]|[01]?\d\d?))') to match the ip address, how can i get the other info and output html ? Thanks

2 Answers2

3

Use a library like apachelog to parse the Apache log lines. It will be more robust and safer than trying to write a regex for the lines.

nneonneo
  • 171,345
  • 36
  • 312
  • 383
2
  • IP Address: r'^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'
  • Time: r'\d{2}/[a-zA-Z]{3}/\d{4}:\d{2}:\d{2}:\d{2} \+\d{4}'
  • Time (alternate): r'(?<=\[).+?(?=\])', lazy, assuming date will always be inside [] and only date will ever be inside []
  • URL: r'https?://.+?(?= HTTP)'
  • Google URL: r'(?<=")https?://.*?google\..*?(?=")'
  • Browser: r'(?<=")Mozilla.+?(?=")'

However, as nneonneo pointed out, using a tool like apachelog will be a lot more robust and reliable.

jdotjdot
  • 16,134
  • 13
  • 66
  • 118
  • Thank you very much~ by the way, i parse the url from resources without google, and like bing, yahoo search engin and so on, i want to parse the keyword behind ?q= or q=, how can i match the keyword ? i use (?<=q=)|(? – Hisone Nightmare Oct 11 '12 at 02:45
  • For a new question, it's best if you open up a new Stack Overflow question. I'd be happy to answer it there. Feel free to message me the link when you've done so so I can answer it. – jdotjdot Oct 11 '12 at 02:51
  • http://stackoverflow.com/questions/12831537/parse-and-match-the-keyword-in-search-engine-url-use-python-re Thank you again : ) – Hisone Nightmare Oct 11 '12 at 03:02