0

i am trying to parse through a series of XML files and use beautiful soup to get certain values that are embedded in tags using beautiful soup using these functions:

case_feature_keys = ['year', 'offenceCategory', 'offenceSubcategory']
person_feature_keys = ['gender', 'age', 'occupation', 'given']
outcome_key = 'verdictCategory'

case_feature_keys = ['year', 'offenceCategory', 'offenceSubcategory']
person_feature_keys = ['gender', 'age', 'occupation', 'given']
outcome_key = 'verdictCategory'


def get_person_features(trial_account, person_type: str):
    person_features = {}
    for key in person_feature_keys:
        matches = [x for x in trial_account.find_all(type=key) if person_type in 
x.parent.attrs.get("type", "")]
        if matches:
            person_features[person_type + "_" + key] = matches[0]
    return person_features

def process_trial_account(trial_account) -> dict:
    """
    Takes in a single account and returns a dictionary representing a row of the table.
    """

    case_features = {key: trial_account.find(type=key) for key in case_feature_keys}
    defendant_features = get_person_features(trial_account, 'defendant')
    victim_features = get_person_features(trial_account, 'victim')
    outcome = trial_account.find(type=outcome_key)


    features = {**case_features, **defendant_features, **victim_features, "outcome": outcome or {}}
    return {key: value.get("value") for key, value in features.items()}

the XML file looks like this:

</persName>
                
             .</p>
 <p>
 <persName id="t18100221-1-person52">
                   GEORGE 
                   ROSS
                <interp inst="t18100221-1-person52" type="surname" value="ROSS"/>
 <interp inst="t18100221-1-person52" type="given" value="GEORGE"/>
 <interp inst="t18100221-1-person52" type="gender" value="male"/>
 </persName>
             . Q. Were you in trade - A. Yes, as a <rs id="t18100221-1-viclabel3" type="occupation">merchant</rs>
 <join result="persNameOccupation" targOrder="Y" targets="t18100221-1-victim51 t18100221-1-viclabel3"/>; I lived in <placeName id="t18100221-1-crimeloc4">New Basinghall-street</placeName>
 <interp inst="t18100221-1-crimeloc4" type="placeName" value="New Basinghall-street"/>
 <interp inst="t18100221-1-crimeloc4" type="type" value="crimeLocation"/>
 <join result="offencePlace" targOrder="Y" targets="t18100221-1-off1 t18100221-1-crimeloc4"/>; the prisoner was my <rs id="t18100221-1-deflabel5" type="occupation">clerk</rs>

The problem I am having is that the occupation category is not contained within a 'value=' line like most of the other features are. If you look below, the occupation is embedded within the tag itself, like so: 'id="t18100221-1-viclabel3" type="occupation">merchant' rather than gender, for example, which is contained in a line like this: 'type="gender" value="male"/>' so i can use the function above to get this attribute because it is contained within a type/value.

does anyone know how i can retrieve the occupation for victims and defendants?

Yaz
  • 15
  • 2
  • 1
    Using Beautiful Soup sounds like a long-winded way to parse XML. Would it not be easier just to use a regular XML parser? – halfer May 13 '21 at 20:01

1 Answers1

0

To get occupation and person type you can use this example:

from bs4 import BeautifulSoup

html_data = """
 <persName id="t18100221-1-person52">
                   GEORGE 
                   ROSS
                <interp inst="t18100221-1-person52" type="surname" value="ROSS"/>
 <interp inst="t18100221-1-person52" type="given" value="GEORGE"/>
 <interp inst="t18100221-1-person52" type="gender" value="male"/>
 </persName>
             . Q. Were you in trade - A. Yes, as a <rs id="t18100221-1-viclabel3" type="occupation">merchant</rs>
 <join result="persNameOccupation" targOrder="Y" targets="t18100221-1-victim51 t18100221-1-viclabel3"/>; I lived in <placeName id="t18100221-1-crimeloc4">New Basinghall-street</placeName>
 <interp inst="t18100221-1-crimeloc4" type="placeName" value="New Basinghall-street"/>
 <interp inst="t18100221-1-crimeloc4" type="type" value="crimeLocation"/>
 <join result="offencePlace" targOrder="Y" targets="t18100221-1-off1 t18100221-1-crimeloc4"/>; the prisoner was my <rs id="t18100221-1-deflabel5" type="occupation">clerk</rs>
 """

soup = BeautifulSoup(html_data, "html.parser")

for occupation in soup.select('[type="occupation"]'):
    id_ = occupation["id"]
    o = occupation.text
    person_type = "victim" if "vic" in id_ else "defendant"
    print("ID: {} Occupation: {} Person type: {}".format(id_, o, person_type))

Prints:

ID: t18100221-1-viclabel3 Occupation: merchant Person type: victim
ID: t18100221-1-deflabel5 Occupation: clerk Person type: defendant
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91