0

I have an xml file that looks like:

<!-- For the full list of available Crowd HTML Elements and their input/output documentation,
      please refer to https://docs.aws.amazon.com/sagemaker/latest/dg/sms-ui-template-reference.html -->

<!-- You must include crowd-form so that your task submits answers to MTurk -->
<crowd-form answer-format="flatten-objects">

    <!-- The crowd-classifier element will create a tool for the Worker to
 select the correct answer to your question.
          Your image file URLs will be substituted for the "image_url" variable below

          when you publish a batch with a CSV input file containing multiple image file URLs.

          To preview the element with an example image, try setting the src attribute to

          "https://s3.amazonaws.com/cv-demo-images/two-birds.jpg" -->
<crowd-image-classifier\n        
src= "https://someone@example.com/abcd.jpg"\n        
categories="[\'Yes\', \'No\']"\n        
header="abcd"\n        
name="image-contains">\n\n       
<!-- Use the short-instructions section for quick instructions that the Worker\n
will see while working on the task. Including some basic examples of\n              
good and bad answers here can help get good results. You can include\n              
any HTML here. -->\n        
<short-instructions>\n\n        
</crowd-image-classifier>
</crowd-form>
<!-- YOUR HTML ENDS -->

I want to extract the line:

src = https://someone@example.com/abcd.jpg

and assign it to a variable in python. Bit New to xml parsing:

I tried like:

hit_doc = xmltodict.parse(get_hit['HIT']['Question'])
image_url = hit_doc['HTMLQuestion']['HTMLContent']['crowd-form']['crowd-image-classifier']

Error:

    image_url = hit_doc['HTMLQuestion']['HTMLContent']['crowd-form']['crowd-image-classifier']
TypeError: string indices must be integers

If I don't access the ['crowd-image-classifier'] in code and limit myself to

hit_doc = xmltodict.parse(get_hit['HIT']['Question'])
image_url = hit_doc['HTMLQuestion']['HTMLContent']

Then I'm getting complete xml file.

How to access that img src?

  • It looks like `hit_doc['HTMLQuestion']['HTMLContent']` returns a list with multiple `crowd-image-classifier`, not a dict. Try `image_url = hit_doc['HTMLQuestion']['HTMLContent'][0]['crowd-image-classifier']`. – luigibertaco Nov 26 '19 at 06:00
  • Have edited my question now. Hope it helps better. PS, still got the same error. – Ajay Bhagchandani Nov 26 '19 at 06:07

2 Answers2

1

You can use BeautifulSoup. See a the working code below.

from bs4 import BeautifulSoup


html = '''<!-- For the full list of available Crowd HTML Elements and their input/output documentation,
      please refer to https://docs.aws.amazon.com/sagemaker/latest/dg/sms-ui-template-reference.html -->

<!-- You must include crowd-form so that your task submits answers to MTurk -->
<crowd-form answer-format="flatten-objects">

    <!-- The crowd-classifier element will create a tool for the Worker to
 select the correct answer to your question.
          Your image file URLs will be substituted for the "image_url" variable below

          when you publish a batch with a CSV input file containing multiple image file URLs.

          To preview the element with an example image, try setting the src attribute to

          "https://s3.amazonaws.com/cv-demo-images/two-birds.jpg" -->
<crowd-image-classifier\n        
src= "https://someone@example.com/abcd.jpg"\n        
categories="[\'Yes\', \'No\']"\n        
header="abcd"\n        
name="image-contains">\n\n       
<!-- Use the short-instructions section for quick instructions that the Worker\n
will see while working on the task. Including some basic examples of\n              
good and bad answers here can help get good results. You can include\n              
any HTML here. -->\n        
<short-instructions>\n\n        
</crowd-image-classifier>
</crowd-form>
<!-- YOUR HTML ENDS -->'''

soup = BeautifulSoup(html, 'html.parser')
element = soup.find('crowd-image-classifier')
print(element['src'])

output

https://someone@example.com/abcd.jpg
balderman
  • 22,927
  • 7
  • 34
  • 52
  • The line ```hit_doc = xmltodict.parse(get_hit['HIT']['Question'])['HTMLQuestion']['HTMLContent']``` is giving me the same xml content with type **str**. However, on using the statements ```soup = BeautifulSoup(hit_doc, 'html.parser') element = soup.find('crowd-image-classifier') print(element)``` , it's returning me the element as None. :( – Ajay Bhagchandani Nov 27 '19 at 08:02
0

I switched to using xml element tree

Syntax I got is somewhat similar to:

import xml.etree.ElementTree as ET
root = ET.fromstring(hit_doc)
for child in root:
    if child[0].text == 'crowd-image-classifier':
    image_data = child[1].text