1

I am new to using regex.

I have a string in the form

                Waco, Texas     

                Unit Dose 13 and 





           SECTION 011100       SUMMARY OF WORK





    INDEX   PAGE



PART 1. - GENERAL   1

1.1.    RELATED DOCUMENTS   1

1.2.    PROJECT DESCRIPTION 1

1.3.    OWNER   1

1.4.    ARCHITECT/ENGINEER  2

1.5.    PURCHASE CONTRACTS  2

1.6.    OWNER-FURNISHED ITEMS   2

1.7.    CONTRACTOR-FURNISHED ITEMS  3

1.8.    CONTRACTOR USE OF PREMISES  3

1.9.    OWNER OCCUPANCY 3

1.10.   WORK RESTRICTIONS   4

PART 2. - PRODUCTS - NOT APPLICABLE 4

PART 3. - EXECUTION - NOT APPLICABLE    4

I apologize for the extra white space, but this is the form of the word document I parsed to obtain the string.

I need to capture all of the heading between PART 1 PART 2 and PART 3 and store them in a different list. So far I have

matchedtext = re.findall('(?<=PART) (.*?) (?=PART)', text, re.DOTALL)

If I understand correctly, these look arounds should use PART as a sort of base point and grab the text in between. However, matchedtext does not fill with anything when I run the code.

The second part of my problem is once I have the text in between the different occurrences of PART how can I save just the capitalized headings in a list with a string for each heading. Some of my strings from the word documents contain lowercase words, but I just want the words that are all in caps.

So to summarize how can I grab the text between specific words in a string and once I have them how can I save the words as individual strings in a list.

Thanks for the help! :D

inbinder
  • 692
  • 4
  • 11
  • 28
Jstuff
  • 1,266
  • 2
  • 16
  • 27

3 Answers3

4

You don't even need to use regex, just use the split function for strings. If s is the name of your string, it would be:

s.split('PART')

This will include the text before the first PART, so don't use the first element of the list:

texts_between_parts = s.split('PART')[1:]

You can later check if a word is all upper case using the string method isupper.

Suzana
  • 4,251
  • 2
  • 28
  • 52
  • Okay, split is a cool trick, but I'm not sure isupper will work for a couple reasons. Sometimes my string contains subtext with sentences so isupper will grab the capitalized word at the beginning of the sentence. I tried using isupper real quick with upper = ''.join([c for c in text_between_parts if c.isupper()]) and it didn't omit the numbers either. This is why I was trying to use regex – Jstuff May 31 '16 at 14:59
  • Try `[s for s in c.split() for c in texts_between_parts if s.isupper() and s.isalpha()]` – Suzana May 31 '16 at 15:09
  • I'm sorry this is a newby question, but that code doesn't quite work because it returns the text 3 times over. I am trying to understand what happens in it using this http://stackoverflow.com/questions/17006641/single-line-nested-for-loops but I can't seem to understand it can you explain it to me? Thanks – Jstuff May 31 '16 at 15:28
  • Use for loops instead of the double list comprehension. – Suzana May 31 '16 at 15:40
1

I would forget grabbing everything between Part 1 and Part 2,etc. I would parse each line with the following regex and use Group 1 to determine the grouping of the headings.

^(\d)(\.|\d)+\s+([^a-z]+?)\s+\d$

Group 1 is the Part Number/Section

Group 2 is the Sub Section

Group 3 is the Heading

import re

p = re.compile('^(\d)(\.|\d)+\s+([^a-z]+?)\s+\d$')

m = p.match( '1.4.    ARCHITECT/ENGINEER  2' )

if m:

    print('Match found: ', m.groups())

else:

    print('No match')

Match found: ('1', '.', 'ARCHITECT/ENGINEER')

tanuki505
  • 23
  • 3
  • Would you be able to explain what this regex expression does? It is a little above my current ability. Also should the [^a-z] be [^A-Z] since it is looking for capitalized words? Would it just be implemented like so matches = re.search('^(\d)(\.|\d)+\s+([^a-z]+?)\s+\d$', text) ? Thanks – Jstuff May 31 '16 at 15:58
  • ^ beginning of line (\d) create a group for the first digit/section number (\.|\d) create a second group capturing 0 or more subsection numbers .1.1 \s+ capture any spaces ([^a-z]) create a third group to capture anything that does not contain lowercase letters \s+ capture the remaining spaces \d$ capture the page number at the end of line – tanuki505 May 31 '16 at 16:04
  • Okay, this makes sense but when I run it it does not find any matches. I'm currently trying to find out why. – Jstuff May 31 '16 at 17:59
  • Any ideas as to why would be awesome. Thanks – Jstuff May 31 '16 at 18:10
  • Post some realdata.txt and I will take a look. – tanuki505 May 31 '16 at 18:54
0
import re
p = re.compile('^(\d)(\.|\d)+\s+([^a-z]+?)\s+\d$')
m = p.match( '1.4.    ARCHITECT/ENGINEER  2' )
if m:
    print('Section: ', m.group(1))
    print('Heading: ', m.group(3))
else:
    print('No match')

# Output 
# Section:  1
# Heading:  ARCHITECT/ENGINEER
tanuki505
  • 23
  • 3