How to extract data from string using Python RegEx?

Question

I have file names in this format :

INC_2AB_22BA_1300435674_218_19-May-2014_13-09-59.121._OK
INC_2EE_22RE_1560343444_119_11-Jun-2014_15-21-32.329._OK
INC_2CD_22HY_1652323334_312_21-Jan-2014_11-15-48.291._OK

I want to extract the name before the date part. For instance, anything before _19-May-2014_13-09-59.121._OK is desired in first file yielding INC_2AB_22BA_1300435674_218

I tried lookback method but unable to wrap my head around this at the moment.

Essential, trying to match this pattern _[0-9]-[aA-bB]-*

is there always the same number of items separated by underscores before the date? — Casimir et Hippolyte, Aug 20 '14 at 00:23
is there always the same number of characters before the date? — Casimir et Hippolyte, Aug 20 '14 at 00:53
Internet Explorer was not letting me reply yesterday. It is not fixed width though the number of underscores are always fixed. — ThinkCode, Aug 20 '14 at 18:01

hwnd · Answer 1 · 2014-08-20T00:40:18.313

3

If your format is consistent you could use the following.

>>> s = 'INC_2AB_22BA_1300435674_218_19-May-2014_13-09-59.121._OK'
>>> '_'.join(s.split('_')[0:5])
'INC_2AB_22BA_1300435674_218'

edited Aug 20 '14 at 00:40

answered Aug 20 '14 at 00:24

hwnd

69,796
4
95
132

Could you please explain? – ThinkCode Aug 20 '14 at 00:35
This splits on an underscore and then joins list items 0 through 5 with an underscore creating your original format. – hwnd Aug 20 '14 at 00:38
@ThinkCode: Could you describe exactly the format of your string? – Casimir et Hippolyte Aug 20 '14 at 00:41

score 2 · Accepted Answer · answered Aug 20 '14 at 00:23

You could try the below code,

>>> import re
>>> s = """INC_2AB_22BA_1300435674_218_19-May-2014_13-09-59.121._OK
... INC_2EE_22RE_1560343444_119_11-Jun-2014_15-21-32.329._OK
... INC_2CD_22HY_1652323334_312_21-Jan-2014_11-15-48.291._OK"""
>>> m = re.findall(r'^.*?(?=_\d{2}-[A-Z][a-z]{2}-\d{4})', s, re.M)
>>> for i in m:
...     print i
... 
INC_2AB_22BA_1300435674_218
INC_2EE_22RE_1560343444_119
INC_2CD_22HY_1652323334_312

score 2 · Answer 3 · answered Aug 20 '14 at 00:23

2

Try this:

.*(?=_\d{1,2}-[a-zA-Z]{3})

It uses a lookahead assertion for the _00-Aaa format of the date you have there.

answered Aug 20 '14 at 00:23

Vasili Syrakis

9,321
1
39
56

score 1 · Answer 4 · answered Aug 20 '14 at 00:28

1

Looks like the lines have a standard size. Just use

offset = len('INC_2AB_22BA_1300435674_218')`
for line in input:
     print line[:offset]

answered Aug 20 '14 at 00:28

fabrizioM

46,639
15
102
119

If the length is always the same, it is the best answer. – Casimir et Hippolyte Aug 20 '14 at 00:38
Note that if you know it in advance, you don't need to use len. But I assume that it is only here to illustrate the idea. – Casimir et Hippolyte Aug 20 '14 at 00:44

dawg · Answer 5 · 2014-08-20T03:02:25.057

0

Since your desired data is at the start of the line, anchors make a search fairly easy:

^(.*)(?:_\d{2}-[a-zA-Z]{3}-\d{4})

Regular expression visualization

Debuggex Demo

>>> import re
>>> txt='''\
... INC_2AB_22BA_1300435674_218_19-May-2014_13-09-59.121._OK
... INC_2EE_22RE_1560343444_119_11-Jun-2014_15-21-32.329._OK
... INC_2CD_22HY_1652323334_312_21-Jan-2014_11-15-48.291._OK'''
>>> 
>>> re.findall(r'^(.*)(?:_\d{2}-[a-zA-Z]{3}-\d{4})', txt, re.M)
['INC_2AB_22BA_1300435674_218', 'INC_2EE_22RE_1560343444_119', 'INC_2CD_22HY_1652323334_312']

If you want to be even more specific of matching 'Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec' in the date field, you can do:

>>> re.findall(r'^([^-]+)(?:_\d{2}-(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-\d{4})', txt, re.M)

... same output

edited Aug 20 '14 at 03:02

answered Aug 20 '14 at 00:33

dawg

98,345
23
131
206

It's not very different from Avanish Raj answer but it may be a little faster due to the length of the string before and after the date. – Casimir et Hippolyte Aug 20 '14 at 00:36
1

To reduce the amount of backtracking, you can replace `.` with `[^-]` – Casimir et Hippolyte Aug 20 '14 at 01:10

hex494D49 · Answer 6 · 2014-08-20T11:03:11.330

0

Yet another solution. If the length is always the same, following regular expression may be used as well

^([^$]{27})

Or this one

^(.{27})

Demo

edited Aug 20 '14 at 11:03

answered Aug 20 '14 at 00:42

hex494D49

9,109
3
38
47

You're better off using `^.{27}` – hwnd Aug 20 '14 at 00:44
Using a regex to get the first 27th characters of a string is overkill and slower than using indexes; see fabrizioM answer. – Casimir et Hippolyte Aug 20 '14 at 00:48
@CasimiretHippolyte You might be right; just wanted to add another solution :) – hex494D49 Aug 20 '14 at 00:48

How to extract data from string using Python RegEx?

6 Answers6