Searching for content between non-tag strings

Question

I'm using Python to try to pull data from this old code, and the content of interest is not between neat HTML tags but rather between strings of characters including punctuation and letters. Rather than getting each piece of content though I'm getting everything between the first instance of the initial string and the last instance of the final bounding string. For example:

>>> q = '"text:"content_of_interest_1",body, code code "text:":content_of_interest_2",body'

>>> start1 = '"text:"'

>>> end1 = '",body'

>>> print q[q.find(start1)+len(start1):q.rfind(end1)]
content_of_interest_1",body, code code "text:":content_of_interest_2

I'm instead looking to get out each instance of content bounded by start1 and end1, i.e.:

content_of_interest_1, content_of_interest_2

How can I re-phrase my code to get each instance of string-bounded content rather than all bounded content as above?

Possible duplicate: http://stackoverflow.com/questions/766372/python-non-greedy-regexes — sgp, Jun 02 '15 at 10:41

Mazdak · Accepted Answer · 2015-06-02T10:52:22.887

You need to use q.find to end1 instead of rfind for first sub-string and rfind for last one:

>>> q[q.find(start1)+len(start1):q.find(end1)]
'content_of_interest_1'
>>> q[q.rfind(start1)+len(start1):q.rfind(end1)]
':content_of_interest_2'

But using find will give you just the index of first occurrence of start and end. So as a more proper way fro such tasks you can simply use regular expression :

>>> re.findall(r':"(.*?)"',q)
['content_of_interest_1', ':content_of_interest_2']

styvane · Answer 2 · 2015-06-02T10:55:08.637

1

You can use regular expression with positive lookehind

import re
re.findall(r'(?<="text:"):?\w+', q)
#['content_of_interest_1', ':content_of_interest_2']

edited Jun 02 '15 at 10:55

answered Jun 02 '15 at 10:49

styvane

59,869
19
150
156

Searching for content between non-tag strings

2 Answers2