1

I have a file with content something like this:

SUBJECT COMPANY:    

    COMPANY DATA:   
        COMPANY CONFORMED NAME:         MISCELLANEOUS SUBJECT CORP
        CENTRAL INDEX KEY:          0000000000
        STANDARD INDUSTRIAL CLASSIFICATION:  []
        IRS NUMBER:             123456789
        STATE OF INCORPORATION:         DE
        FISCAL YEAR END:            1231

Then later in the file, it has something like this:

<REPORTING-OWNER>

COMPANY DATA:   
    COMPANY CONFORMED NAME:         MISCELLANEOUS OWNER CORP
    CENTRAL INDEX KEY:          0101010101
    STANDARD INDUSTRIAL CLASSIFICATION:  []

What I need to do is capture the company conformed name, central index key, IRS number, fiscal year end, or whatever I am looking to extract, but only in the subject company section--not the reporting owner section. These lines may be in any order, or not present, but I want to capture their values if they are present.

The regex I was trying to build looks like this:

(?:COMPANY CONFORMED NAME:\s*(?'conformed_name'(?!(?:A|AN|THE)\b)[A-Z\-\/\\=|&!#$(){}:;,@`. ]+)|CENTRAL INDEX KEY:\s*(?'cik'\d{10})|IRS NUMBER:\s*(?'IRS_number'\w{2}-?\w{7,8})|FISCAL YEAR END:\s*(?'fiscal_year_end'(?:0[1-9]|1[0-2])(?:0[1-9]|[1-2][0-9]|3[0-1])))

The desired results would be as follows:

conformed_name = "MISCELLANEOUS SUBJECT CORP"
CIK = "000000000"
IRS_number = "123456789"
fiscal_year_end = "1231"

Any flavor of regex is acceptable for this, as I'll adapt to whatever works best for the scenario. Thank you for reading about my quandary and for any guidance you can offer.

Matthew
  • 59
  • 6

2 Answers2

0

I ended up figuring it out on my own. Try it out here.

/SUBJECT COMPANY:\s+COMPANY DATA:(?:\s+(?:(?:COMPANY CONFORMED NAME:\s+(?'conformed_name'[^\n]+))|(?:CENTRAL INDEX KEY:\s+(?'CIK'\d{10}))|(?:STANDARD INDUSTRIAL CLASSIFICATION:\s+(?'assigned_SIC'[^\n]+))|(?:IRS NUMBER:\s+?(?'IRS_number'\w{2}-?\w{7,8}))|(?:STATE OF INCORPORATION:\s+(?'state_of_incorporation'\w{2}))|(?:FISCAL YEAR END:\s+(?'fiscal_year_end'(?:0[1-9]|1[0-2])(?:0[1-9]|[1-2][0-9]|3[0-1])))\n))+/s
Matthew
  • 59
  • 6
0

To match only the company section, and only when preceded by “SUBJECT COMPANY”, use a look behind:

(?<=SUBJECT COMPANY:\t\n     \n     )(?:COMPANY CONFORMED NAME:\s*(?'conformed_name'(?!(?:A|AN|THE)\b)[A-Z\-\/\\=|&!#$(){}:;,@`. ]+)|CENTRAL INDEX KEY:\s*(?'cik'\d{10})|IRS NUMBER:\s*(?'IRS_number'\w{2}-?\w{7,8})|FISCAL YEAR END:\s*(?'fiscal_year_end'(?:0[1-9]|1[0-2])(?:0[1-9]|[1-2][0-9]|3[0-1])))
Bohemian
  • 412,405
  • 93
  • 575
  • 722