How to extract timestamp and remove tailing portion from weblog using regex in pyspark?

Question

I am doing some practice on weblog parsing and here is a question on regex:

The log file is in the format of:

in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0" 200 1839

I need to get the timestamp, here is what I have now:

regexp_extract('value', r'((\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} -\d{4}))', 1).alias('timestamp'),

This returns me:

01/Aug/1995:00:00:01 -0400

My question is what does -0400 means? time zone? How do I remove it?

Do you have any understanding of how regular expressions work? Because it should be very obvious to you which part of the regular expression matches `-0400` in that string. And yes, it's a time zone. — miken32, Jul 28 '16 at 02:47
Regex is honorsly so confusing me, but I am willing to learn because it is really powerful. All I want to know is how to get rid of the -0400 and all I need to do is to remove `-\d{4}` — mdivk, Jul 28 '16 at 03:07
That is exactly right. `\d` is a number and `{4}` means there are four of them. — miken32, Jul 28 '16 at 03:19

Shafizadeh · Accepted Answer · 2016-07-28T03:21:09.633

0

Yes - that's a timezone.

You can simply remove it by eliminating -\d{4} part of the pattern. So this is what you're looking for:

regexp_extract('value', r'((\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2}))', 1).alias('timestamp'),

Also as a explanation:

edited Jul 28 '16 at 03:21

answered Jul 28 '16 at 03:11

Shafizadeh

1 Answers1