0

I am doing some practice on weblog parsing and here is a question on regex:

The log file is in the format of:

in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0" 200 1839

I need to get the timestamp, here is what I have now:

regexp_extract('value', r'((\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} -\d{4}))', 1).alias('timestamp'),

This returns me:

01/Aug/1995:00:00:01 -0400

My question is what does -0400 means? time zone? How do I remove it?

TheSharpieOne
  • 25,646
  • 9
  • 66
  • 78
mdivk
  • 3,545
  • 8
  • 53
  • 91
  • Do you have any understanding of how regular expressions work? Because it should be very obvious to you which part of the regular expression matches `-0400` in that string. And yes, it's a time zone. – miken32 Jul 28 '16 at 02:47
  • Regex is honorsly so confusing me, but I am willing to learn because it is really powerful. All I want to know is how to get rid of the -0400 and all I need to do is to remove `-\d{4}` – mdivk Jul 28 '16 at 03:07
  • That is exactly right. `\d` is a number and `{4}` means there are four of them. – miken32 Jul 28 '16 at 03:19

1 Answers1

0

Yes - that's a timezone.

You can simply remove it by eliminating -\d{4} part of the pattern. So this is what you're looking for:

regexp_extract('value', r'((\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2}))', 1).alias('timestamp'),

Online Demo

Also as a explanation:

  • - matches a dash plus a space after it literally
  • \d matches a digit
  • {4} limits it to only 4 digits
Shafizadeh
  • 9,960
  • 12
  • 52
  • 89