-2

I have a txt file of this type:

Thomson Reuters StreetEvents Event Transcript
E D I T E D   V E R S I O N

Q3 2003 ABM Industries Earnings Conference Call
SEPTEMBER 10, 2003 / 1:00PM GMT

================================================================================
Corporate Participants
================================================================================

My txt file is saved:C:\sam\2003-Sep-10-ABM.N-140985434256-Transcript.txt.

I want to extract only transcript year (as 2003) and firm name (as ABM Industries). I used below codes, but ended up with all years.

Code:

import re
f = open("C:\\sam\\2003-Sep-10-ABM.N-140985434256-Transcript.txt", 'r')
content = f.read()
pattern = "\d{4}"
years = re.findall(pattern, content)
for year in years:
    print(year)

My Output: 2003 2003 2003 2003 2002 2003 2002 2003 2003 2002 2003 2002 2002 2003 2002 2002 2002 2002 2002 2003 2003 2003 2004 2003 2003 2003 2004 2019

Expected Output: 2003 ABM Industries

Guy
  • 46,488
  • 10
  • 44
  • 88
Janz
  • 3
  • 2
  • 1
    I seems like your expected output is just the first entry? than why are you using a loop? – Guy Jan 29 '23 at 07:01
  • You match four digits, you iterate `for year in years` - why do you expect this to provide an output like "2003 *ABM Industries*" then? – MisterMiyagi Jan 29 '23 at 07:02
  • Thanks for the comments. Yes, I want to find the codes in order to extract "2003 ABM Industries" as well. – Janz Jan 29 '23 at 07:12
  • I need to collect thousands of firms’ information, then I have to extract “firm name” and “year” from the codes (the loop). – Janz Jan 29 '23 at 07:15
  • Well, how do you define a "firm name"? How would you know it's an "ABC Incorporated"-"Earnings Conference Call" and not an "ABC Incorporated Earnings"-"Conference Call"? So far the only clear criteria your code or description has is "four digits" – which is obviously unsuitable to match anything but the year, and perhaps not even that. What is the exact format you are parsing, not just one example? – MisterMiyagi Jan 29 '23 at 07:24
  • If I can extract the whole text of "ABM Industries Earnings Conference Call," also be fine. Finally, I want to save thousands of transcript names into csv file. – Janz Jan 29 '23 at 07:29

1 Answers1

-1

If I understand you correctly, this should work:

import re 
content = """Q3 2003 ABM Industries Earnings Conference Call
SEPTEMBER 10, 2003 / 1:00PM GMT"""
pattern = "\d{4}+\s\w+\s\w+"
years = re.findall(pattern, content)[0]
print(years)

Output: "2003 ABM Industries"

OrelS
  • 26
  • 3
  • Note that this will only work for companies with a two-word name. It won't work for "ABC Inc.", "ABC Industries Corporation", or "Jan & Sons", for example. (The `\d{4}+` is an RE syntax and when written "correctly" also matches more than just years...) – MisterMiyagi Jan 30 '23 at 10:47
  • The author didn't mentioned boundaries or enough examples, so I gave a solution to a given situation If you want to select the whole line from the year number, the pattern should be "\d{4}.*" – OrelS Jan 30 '23 at 10:54
  • That should have been "The `\d{4}+` is an RE syntax *error*". – MisterMiyagi Jan 30 '23 at 10:55
  • In the current given information, the is no way to identify when the company name ends and other word start – OrelS Jan 30 '23 at 10:59
  • The given code has worked, and that's what I expected to find out. Thank you so much. – Janz Jan 30 '23 at 12:32
  • @Janz you welcome! If you don't mind, it will be nice if you would mark my answer as useful, and "Accept" that your found an helpful answer :) – OrelS Jan 30 '23 at 12:45