0

I need to extract the uid from a .sgm file, I tried the below code but it doesn't, work can anybody help?

Sample .sgm file content:

<miscdoc n='1863099' uid='0001863099_20220120' type='seccomlett' t='frm' mdy='01/20/2022'><rname>Kimbell Tiger Acquisition Corp, 01/20/2022</rname>

<table col='2' type='txt'>
<colspec col='1' colwidth='*'>
<colspec col='2' colwidth='2*'>
<tname>Meta-data</tname>
<tbody>
<row><entry>SEC-HEADER</entry><entry>0001104659-22-005920.hdr.sgml : 20220304</entry></row>
<row><entry>ACCEPTANCE-DATETIME</entry><entry>20220120160231</entry></row>
<row><entry>PRIVATE-TO-PUBLIC</entry></row>
<row><entry>ACCESSION-NUMBER</entry><entry>0001104659-22-005920</entry></row>
<row><entry>TYPE</entry><entry>CORRESP</entry></row>
<row><entry>PUBLIC-DOCUMENT-COUNT</entry><entry>1</entry></row>
<row><entry>FILING-DATE</entry><entry>20220120</entry></row>
<row><entry>FILER</entry></row>

code I tried:

import os  
# Folder Path
path = "Enter Folder Path" 
# Change the directory
os.chdir(path) 
# Read text File  
def read_file(file_path):
    with open(file_path, 'r') as f:
        print(f.read())  
# iterate through all file
for file in os.listdir():
    # Check whether file is in text format or not
    if file.endswith(".sgm"):
        if 'uid' in file:
            print("true")
        file_path = f"{path}\{file}"
        # call read text file function
        read_file(file_path)

I need extract the uid value from the above sgm file, is there any other way I could do this? what should I change in my code?

  • I *think* SGM files are just a specialised form of XML. If that's the case then there are various XML handling modules available in Python. If it isn't XML and the format is as you show then just open the file, read it line by line looking for a line that begins with ' – DarkKnight May 08 '22 at 11:41

1 Answers1

0

SGM format may just by an XML superset. If it isn't then for this particular case (and if one could rely on the format being as shown in the question) then:

import re

def get_uid(filename):
  with open(filename) as infile:
    for line in map(str.strip, infile):
      if line.startswith('<miscdoc'):
        if uid := re.findall("uid='(.*?)'", line):
          return uid[0]
DarkKnight
  • 19,739
  • 3
  • 6
  • 22