i am trying to parse through a series of XML files and use beautiful soup to get certain values that are embedded in tags using beautiful soup using these functions:
case_feature_keys = ['year', 'offenceCategory', 'offenceSubcategory']
person_feature_keys = ['gender', 'age', 'occupation', 'given']
outcome_key = 'verdictCategory'
case_feature_keys = ['year', 'offenceCategory', 'offenceSubcategory']
person_feature_keys = ['gender', 'age', 'occupation', 'given']
outcome_key = 'verdictCategory'
def get_person_features(trial_account, person_type: str):
person_features = {}
for key in person_feature_keys:
matches = [x for x in trial_account.find_all(type=key) if person_type in
x.parent.attrs.get("type", "")]
if matches:
person_features[person_type + "_" + key] = matches[0]
return person_features
def process_trial_account(trial_account) -> dict:
"""
Takes in a single account and returns a dictionary representing a row of the table.
"""
case_features = {key: trial_account.find(type=key) for key in case_feature_keys}
defendant_features = get_person_features(trial_account, 'defendant')
victim_features = get_person_features(trial_account, 'victim')
outcome = trial_account.find(type=outcome_key)
features = {**case_features, **defendant_features, **victim_features, "outcome": outcome or {}}
return {key: value.get("value") for key, value in features.items()}
the XML file looks like this:
</persName>
.</p>
<p>
<persName id="t18100221-1-person52">
GEORGE
ROSS
<interp inst="t18100221-1-person52" type="surname" value="ROSS"/>
<interp inst="t18100221-1-person52" type="given" value="GEORGE"/>
<interp inst="t18100221-1-person52" type="gender" value="male"/>
</persName>
. Q. Were you in trade - A. Yes, as a <rs id="t18100221-1-viclabel3" type="occupation">merchant</rs>
<join result="persNameOccupation" targOrder="Y" targets="t18100221-1-victim51 t18100221-1-viclabel3"/>; I lived in <placeName id="t18100221-1-crimeloc4">New Basinghall-street</placeName>
<interp inst="t18100221-1-crimeloc4" type="placeName" value="New Basinghall-street"/>
<interp inst="t18100221-1-crimeloc4" type="type" value="crimeLocation"/>
<join result="offencePlace" targOrder="Y" targets="t18100221-1-off1 t18100221-1-crimeloc4"/>; the prisoner was my <rs id="t18100221-1-deflabel5" type="occupation">clerk</rs>
The problem I am having is that the occupation category is not contained within a 'value=' line like most of the other features are. If you look below, the occupation is embedded within the tag itself, like so: 'id="t18100221-1-viclabel3" type="occupation">merchant' rather than gender, for example, which is contained in a line like this: 'type="gender" value="male"/>' so i can use the function above to get this attribute because it is contained within a type/value.
does anyone know how i can retrieve the occupation for victims and defendants?