2

I need to extract time(02/Jan/2015:08:12), article_id, and user_id

line format looks like this:

67.15.143.7 - - [02/Jan/2015:08:12] "GET/click?article_id=25&user_id=104 HTTP/1.1" 200 2327
67.15.143.7 - - [02/Jan/2015:08:12] "GET/click?article_id=211&user_id=9408 HTTP/1.1" 200 380

I'm a beginner and I did search on google and stack overflow, but I haven't find the way to solve it. Can anyone help me? Thanks!

timgeb
  • 76,762
  • 20
  • 123
  • 145
Ryueisan
  • 31
  • 2
  • You probably want to start reading into python regular expression usage, the re module will probably get all the info out of the line you're after. Learning how to write a regex can be a steep learning curve but will pay off massively in the long run. Log analyser programs like logstash use regex heavily to extract info – Rumbles Apr 13 '16 at 21:19

2 Answers2

1

A simple regex can extract that.

>>> import re
>>> s = '''67.15.143.7 - - [02/Jan/2015:08:12] "GET/click?article_id=25&user_id=104 HTTP/1.1" 200 2327
... 67.15.143.7 - - [02/Jan/2015:08:12] "GET/click?article_id=211&user_id=9408 HTTP/1.1" 200 380'''
>>> re.findall('\[(.*?)\].*?article_id=(\d+).*?user_id=(\d+)',s)
[('02/Jan/2015:08:12', '25', '104'), ('02/Jan/2015:08:12', '211', '9408')]

Use re.search instead of re.findall if you want to apply the pattern to individual lines.

timgeb
  • 76,762
  • 20
  • 123
  • 145
1
import re
result = re.findall(r'.*\[(.+)\].+article\_id\=(\d+)\&user_id\=(\d+).*',your_string) 
Ahsanul Haque
  • 10,676
  • 4
  • 41
  • 57