I have a file with content something like this:
SUBJECT COMPANY:
COMPANY DATA:
COMPANY CONFORMED NAME: MISCELLANEOUS SUBJECT CORP
CENTRAL INDEX KEY: 0000000000
STANDARD INDUSTRIAL CLASSIFICATION: []
IRS NUMBER: 123456789
STATE OF INCORPORATION: DE
FISCAL YEAR END: 1231
Then later in the file, it has something like this:
<REPORTING-OWNER>
COMPANY DATA:
COMPANY CONFORMED NAME: MISCELLANEOUS OWNER CORP
CENTRAL INDEX KEY: 0101010101
STANDARD INDUSTRIAL CLASSIFICATION: []
What I need to do is capture the company conformed name, central index key, IRS number, fiscal year end, or whatever I am looking to extract, but only in the subject company section--not the reporting owner section. These lines may be in any order, or not present, but I want to capture their values if they are present.
The regex I was trying to build looks like this:
(?:COMPANY CONFORMED NAME:\s*(?'conformed_name'(?!(?:A|AN|THE)\b)[A-Z\-\/\\=|&!#$(){}:;,@`. ]+)|CENTRAL INDEX KEY:\s*(?'cik'\d{10})|IRS NUMBER:\s*(?'IRS_number'\w{2}-?\w{7,8})|FISCAL YEAR END:\s*(?'fiscal_year_end'(?:0[1-9]|1[0-2])(?:0[1-9]|[1-2][0-9]|3[0-1])))
The desired results would be as follows:
conformed_name = "MISCELLANEOUS SUBJECT CORP"
CIK = "000000000"
IRS_number = "123456789"
fiscal_year_end = "1231"
Any flavor of regex is acceptable for this, as I'll adapt to whatever works best for the scenario. Thank you for reading about my quandary and for any guidance you can offer.