1
text = ' My uncle is admitted in the hospital. the address of the hospital is \n Apollo Health City Campus, Jubilee Hills, Hyderabad - 500 033. '

now i am using this as my regex expression but just getting 'Hills' and not getting the required output.

re.findall(r'(\w\S+\s+)(?=Hyderabad){3}'

My desired output is - ' Apollo Health City Campus, Jubilee Hills, Hyderabad - 500 033. '

I want to write a regex expression which can help me extract 3 to 4 strings prior to the city name like 'Hyderabad' in this case, and whether special characters are presend or not present in the raw string.

  • 3
    That's not 3 to 4 strings, that's 6 strings. This is not a job for a regex. You can split the string by words. look for Hyderabad, then back up until you find a word that doesn't start with a capital letter. – Tim Roberts Mar 30 '21 at 06:11
  • What determines how far you want to go back? Is there a rule that says what counts as part of the address? What distinguishes `"is"` from `"Apollo"` in your example? Are you sure that won't give false results on some other input? – Blckknght Mar 30 '21 at 06:43
  • Will 6 digit PINCODE always be there? – anubhava Mar 30 '21 at 07:22
  • as others pointed out, regex is not a good tool for this job, splitting the string on spaces and going from there is the better option. If you insist, something like this should work: `(\b[A-Z]\w+,? ){3,}Hyderabad - \d{3} \d{3}\.` (adjust the digit stuff at the end) – Onno Rouast Mar 30 '21 at 07:23
  • @anubhava YES 6 DIGIT PIN CODE WILL ALWAYS BE THERE. – guati dibba Mar 31 '21 at 09:16
  • Is there a regex expression possible where i can get all the strings prior to the six digit pincode just before \n (new line character). – guati dibba Apr 01 '21 at 06:04

2 Answers2

0

Why regular expressions are most probably a wrong approach

As Tim Roberts noted above - it is not a problem that can be best handled using regex. It requires a much more powerful tool than that just a regular expression.

You can see the approaches used for identifying addresses and splitting them into elements like street address, city, zip code, etc. in this answer. I hope it can shed some light onto the complexity of this problem.

Your example suggests that what you are in fact trying to do is extraction of information on entities like hospitals and / or their addresses. This can be handled using a Named Entity Recognition tool trained to detect such entities in text.

How to construct lookahead regex

If you use the following regex:

r'((\w\S+\s+){1,6})(?=Hyderabad){3}'

it will extract what you want:

Apollo Health City Campus, Jubilee Hills,

Please see a test example here. Please note, that the part of interest is the first matching group - not the text matched in its entirety.

sophros
  • 14,672
  • 11
  • 46
  • 75
  • thanks a lot, but Is there a regex expression possible where i can get all the strings prior to the six digit pincode just before \n (new line character). ? – guati dibba Apr 01 '21 at 06:04
  • @guatidibba: it is possible to create such a regex but it is a different question to me. Please post it separately and accept this answer as it seems that it answers your initial question. – sophros Apr 01 '21 at 06:09
0

You could use a deque:

from collections import deque

text = ' My uncle is admitted in the hospital. the address of the hospital is Apollo Health City Campus, Jubilee Hills, Hyderabad - 500 033. '

def guess_address(needle, string):
    stack, started = [], False
    de = deque(string.split())

    while de:
        word = de.pop()
        if word == needle:
            stack.append(word)
            started = True
        elif started and word[0].isupper():
            stack.append(word)
        elif started and word[0].islower():
            break

    return stack[::-1]

stack = guess_address('Hyderabad', text)
print(stack)

Which yields

['Apollo', 'Health', 'City', 'Campus,', 'Jubilee', 'Hills,', 'Hyderabad']
Jan
  • 42,290
  • 8
  • 54
  • 79