0

I am having trouble trying to extract text/values on a newline using regex. Im trying to get ("REQUIRED QUALIFICATIONS:") values

if i use:-

    pattern = re.compile(r"JOB RESPONSIBILITIES: .*")
    matches = pattern.finditer(gh)

The output would be =

  _<_sre.SRE_Match object; span=(161, 227), match='JOB DESCRIPTION:   
   Public outreach and strengthen>

BUT if i type:-

    pattern = re.compile(r"REQUIRED QUALIFICATIONS:  .*")

I will get =

    match='REQUIRED QUALIFICATIONS:  \r'>  

Here is the text im trying to extract :

JOB RESPONSIBILITIES: \r\n- Working with the Country Director to provide environmental information\r\nto the general public via regular electronic communications and serving\r\nas the primary local contact to Armenian NGOs and businesses and the\r\nArmenian offices of international organizations and agencies;\r\n- Helping to organize and prepare CENN seminars/ workshops;\r\n- Participating in defining the strategy and policy of CENN in Armenia,\r\nthe Caucasus region and abroad.\r\nREQUIRED QUALIFICATIONS: \r\n- Degree in environmentally related field, or 5 years relevant\r\nexperience;\r\n- Oral and written fluency in Armenian, Russian and English;\r\n- Knowledge/ experience of working with environmental issues specific to\r\nArmenia is a plus.\r\nREMUNERATION:

how do i solve this problem? Thanks in advance.

sacuL
  • 49,704
  • 8
  • 81
  • 106
nabskim
  • 9
  • 1
  • 7
  • Dot, by default, does not match new lines. You'll have to use the `re.DOTALL` modifier if you want such behavior, i.e. `pattern = re.compile(r"REQUIRED QUALIFICATIONS: .*", re.DOTALL)` – zwer Apr 04 '18 at 15:50
  • @zwer i tried using pattern = re.compile(r"REQUIRED QUALIFICATIONS: .*", re.DOTALL) but the output is not extracting the whole value its only taking match='REQUIRED QUALIFICATIONS: \r\n- Degree in environ> – nabskim Apr 04 '18 at 16:03

2 Answers2

0

You can use : Positive Lookbehind (?<=REQUIRED QUALIFICATIONS:)

code:

import re

text = """
JOB RESPONSIBILITIES:
- Working with the Country Director to provide environmental information

to the general public via regular electronic communications and serving

as the primary local contact to Armenian NGOs and businesses and the

Armenian offices of international organizations and agencies;

- Helping to organize and prepare CENN seminars/ workshops;

- Participating in defining the strategy and policy of CENN in Armenia,

the Caucasus region and abroad.
REQUIRED QUALIFICATIONS:

- Degree in environmentally related field, or 5 years relevant

experience;

- Oral and written fluency in Armenian, Russian and English;

- Knowledge/ experience of working with environmental issues specific to

Armenia is a plus.

REMUNERATION:
"""





pattern =r'(?<=REQUIRED QUALIFICATIONS:)(\s.+)?REMUNERATION'

print(re.findall(pattern,text,re.DOTALL))

output:

['\n\n- Degree in environmentally related field, or 5 years relevant\n\nexperience;\n\n- Oral and written fluency in Armenian, Russian and English;\n\n- Knowledge/ experience of working with environmental issues specific to\n\nArmenia is a plus.\n\n']

regex information:

Positive Lookbehind (?<=REQUIRED QUALIFICATIONS:)
Assert that the Regex below matches


*REQUIRED QUALIFICATIONS*:   matches the characters REQUIRED *QUALIFICATIONS*:                literally (case sensitive)
*1st Capturing Group*        (\s.+)?
*? Quantifier* —             Matches between zero and one times, as 
                             many times as possible, giving back as 
                             needed (greedy)
*\s*                         matches any whitespace character (equal to 
                             [\r\n\t\f\v ])
*.+*                         matches any character 
*+* Quantifier —             Matches between one and unlimited times, 
                             as many times as possible, giving back as 
                             needed 
Aaditya Ura
  • 12,007
  • 7
  • 50
  • 88
0

You may try this regex which is same with yours except that this includes an inline modifier, (?s) ( Single-line or Dot-all modifier which enables dot(.) indicate all characters including vertical white spaces , newline([\n\r]), etc so that enables manipulating multiple lines texts as like single line string.)

(?s)JOB RESPONSIBILITIES: .*

And I used re.match() function and get the full match strings from the group(0) as follows

ss="""JOB RESPONSIBILITIES: \r\n- Working with the Country Director to provide environmental information\r\nto the general public via regular electronic communications and serving\r\nas the primary local contact to Armenian NGOs and businesses and the\r\nArmenian offices of international organizations and agencies;\r\n- Helping to organize and prepare CENN seminars/ workshops;\r\n- Participating in defining the strategy and policy of CENN in Armenia,\r\nthe Caucasus region and abroad.\r\nREQUIRED QUALIFICATIONS: \r\n- Degree in environmentally related field, or 5 years relevant\r\nexperience;\r\n- Oral and written fluency in Armenian, Russian and English;\r\n- Knowledge/ experience of working with environmental issues specific to\r\nArmenia is a plus.\r\nREMUNERATION:"""

pattern= re.compile(r"(?s)JOB RESPONSIBILITIES: .*")
print(pattern.match(ss).group(0))

output result is

JOB RESPONSIBILITIES: 
- Working with the Country Director to provide environmental information
to the general public via regular electronic communications and serving
as the primary local contact to Armenian NGOs and businesses and the
Armenian offices of international organizations and agencies;
- Helping to organize and prepare CENN seminars/ workshops;
- Participating in defining the strategy and policy of CENN in Armenia,
the Caucasus region and abroad.
REQUIRED QUALIFICATIONS: 

Additionally, you can set the Dot-all(or single-line) modifier through python re module's functions flag re.S like follows

pattern= re.compile(r"JOB RESPONSIBILITIES: .*",re.S)

For more information, please refer to re — Regular expression operations

Thm Lee
  • 1,236
  • 1
  • 9
  • 12