0

I have file names in this format :

INC_2AB_22BA_1300435674_218_19-May-2014_13-09-59.121._OK
INC_2EE_22RE_1560343444_119_11-Jun-2014_15-21-32.329._OK
INC_2CD_22HY_1652323334_312_21-Jan-2014_11-15-48.291._OK

I want to extract the name before the date part. For instance, anything before _19-May-2014_13-09-59.121._OK is desired in first file yielding INC_2AB_22BA_1300435674_218

I tried lookback method but unable to wrap my head around this at the moment.

Essential, trying to match this pattern _[0-9]-[aA-bB]-*

ThinkCode
  • 7,841
  • 21
  • 73
  • 92

6 Answers6

3

If your format is consistent you could use the following.

>>> s = 'INC_2AB_22BA_1300435674_218_19-May-2014_13-09-59.121._OK'
>>> '_'.join(s.split('_')[0:5])
'INC_2AB_22BA_1300435674_218'
hwnd
  • 69,796
  • 4
  • 95
  • 132
2

You could try the below code,

>>> import re
>>> s = """INC_2AB_22BA_1300435674_218_19-May-2014_13-09-59.121._OK
... INC_2EE_22RE_1560343444_119_11-Jun-2014_15-21-32.329._OK
... INC_2CD_22HY_1652323334_312_21-Jan-2014_11-15-48.291._OK"""
>>> m = re.findall(r'^.*?(?=_\d{2}-[A-Z][a-z]{2}-\d{4})', s, re.M)
>>> for i in m:
...     print i
... 
INC_2AB_22BA_1300435674_218
INC_2EE_22RE_1560343444_119
INC_2CD_22HY_1652323334_312
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
2

Try this:

.*(?=_\d{1,2}-[a-zA-Z]{3})

It uses a lookahead assertion for the _00-Aaa format of the date you have there.

Vasili Syrakis
  • 9,321
  • 1
  • 39
  • 56
1

Looks like the lines have a standard size. Just use

offset = len('INC_2AB_22BA_1300435674_218')`
for line in input:
     print line[:offset]
fabrizioM
  • 46,639
  • 15
  • 102
  • 119
0

Since your desired data is at the start of the line, anchors make a search fairly easy:

^(.*)(?:_\d{2}-[a-zA-Z]{3}-\d{4})

Regular expression visualization

Debuggex Demo

>>> import re
>>> txt='''\
... INC_2AB_22BA_1300435674_218_19-May-2014_13-09-59.121._OK
... INC_2EE_22RE_1560343444_119_11-Jun-2014_15-21-32.329._OK
... INC_2CD_22HY_1652323334_312_21-Jan-2014_11-15-48.291._OK'''
>>> 
>>> re.findall(r'^(.*)(?:_\d{2}-[a-zA-Z]{3}-\d{4})', txt, re.M)
['INC_2AB_22BA_1300435674_218', 'INC_2EE_22RE_1560343444_119', 'INC_2CD_22HY_1652323334_312']

If you want to be even more specific of matching 'Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec' in the date field, you can do:

>>> re.findall(r'^([^-]+)(?:_\d{2}-(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-\d{4})', txt, re.M)

... same output

dawg
  • 98,345
  • 23
  • 131
  • 206
0

Yet another solution. If the length is always the same, following regular expression may be used as well

^([^$]{27})

Or this one

^(.{27}) 

Demo

hex494D49
  • 9,109
  • 3
  • 38
  • 47