How to use "for loop" in Python to extract year and firm name (for earning call transcripts) from a txt file

Question

I have a txt file of this type:

Thomson Reuters StreetEvents Event Transcript
E D I T E D   V E R S I O N

Q3 2003 ABM Industries Earnings Conference Call
SEPTEMBER 10, 2003 / 1:00PM GMT

================================================================================
Corporate Participants
================================================================================

My txt file is saved:C:\sam\2003-Sep-10-ABM.N-140985434256-Transcript.txt.

I want to extract only transcript year (as 2003) and firm name (as ABM Industries). I used below codes, but ended up with all years.

Code:

import re
f = open("C:\\sam\\2003-Sep-10-ABM.N-140985434256-Transcript.txt", 'r')
content = f.read()
pattern = "\d{4}"
years = re.findall(pattern, content)
for year in years:
    print(year)

My Output: 2003 2003 2003 2003 2002 2003 2002 2003 2003 2002 2003 2002 2002 2003 2002 2002 2002 2002 2002 2003 2003 2003 2004 2003 2003 2003 2004 2019

Expected Output: 2003 ABM Industries

I seems like your expected output is just the first entry? than why are you using a loop? — Guy, Jan 29 '23 at 07:01
You match four digits, you iterate `for year in years` - why do you expect this to provide an output like "2003 *ABM Industries*" then? — MisterMiyagi, Jan 29 '23 at 07:02
Thanks for the comments. Yes, I want to find the codes in order to extract "2003 ABM Industries" as well. — Janz, Jan 29 '23 at 07:12
I need to collect thousands of firms’ information, then I have to extract “firm name” and “year” from the codes (the loop). — Janz, Jan 29 '23 at 07:15
Well, how do you define a "firm name"? How would you know it's an "ABC Incorporated"-"Earnings Conference Call" and not an "ABC Incorporated Earnings"-"Conference Call"? So far the only clear criteria your code or description has is "four digits" – which is obviously unsuitable to match anything but the year, and perhaps not even that. What is the exact format you are parsing, not just one example? — MisterMiyagi, Jan 29 '23 at 07:24
If I can extract the whole text of "ABM Industries Earnings Conference Call," also be fine. Finally, I want to save thousands of transcript names into csv file. — Janz, Jan 29 '23 at 07:29

score -1 · Accepted Answer · answered Jan 30 '23 at 09:49

-1

If I understand you correctly, this should work:

import re 
content = """Q3 2003 ABM Industries Earnings Conference Call
SEPTEMBER 10, 2003 / 1:00PM GMT"""
pattern = "\d{4}+\s\w+\s\w+"
years = re.findall(pattern, content)[0]
print(years)

Output: "2003 ABM Industries"

answered Jan 30 '23 at 09:49

OrelS

26
3

Note that this will only work for companies with a two-word name. It won't work for "ABC Inc.", "ABC Industries Corporation", or "Jan & Sons", for example. (The `\d{4}+` is an RE syntax and when written "correctly" also matches more than just years...) – MisterMiyagi Jan 30 '23 at 10:47
The author didn't mentioned boundaries or enough examples, so I gave a solution to a given situation If you want to select the whole line from the year number, the pattern should be "\d{4}.*" – OrelS Jan 30 '23 at 10:54
That should have been "The `\d{4}+` is an RE syntax *error*". – MisterMiyagi Jan 30 '23 at 10:55
In the current given information, the is no way to identify when the company name ends and other word start – OrelS Jan 30 '23 at 10:59
The given code has worked, and that's what I expected to find out. Thank you so much. – Janz Jan 30 '23 at 12:32
@Janz you welcome! If you don't mind, it will be nice if you would mark my answer as useful, and "Accept" that your found an helpful answer :) – OrelS Jan 30 '23 at 12:45

How to use "for loop" in Python to extract year and firm name (for earning call transcripts) from a txt file

1 Answers1