1

I want to extract the value of the "archivo" key of something like this:

...
<applet name="bla" code="Any.class" archive="Any.jar">
<param name="abc" value="space='1' archivo='bla.jpg'" </param>
<param name="def" value="space='2' archivo='bli.jpg'" </param>
<param name="jkl" value="space='3' archivo='blu.jpg'" </param>
</applet>
...

I suppose I need a list with [bla.jpg, bli.jpg, ...], so I try options like:

inputTag = soup.findAll("param",{'value':'archivo'})

or

inputTag = soup.findAll(attrs={"value" : "archivo"})

or

inputTag = soup.findAll("archivo")

and always I get an empty list: []

Other unsuccessful options:

inputTag = soup.findAll("param",{"value" : "archivo"}.contents)

I get something like: a dict object hasn't attribute contents

inputTag = unicode(getattr(soup.findAll('archivo'), 'string', ''))

I get nothing.

Finally I have seen: Difference between attrMap and attrs in beautifulSoup, and:

for tag in soup.recursiveChildGenerator():
    print tag['archivo']

find nothing, it must be tag of name, code or archive keys.

and more finally:

tag.attrs = [(key,value) for key,value in tag.attrs if key == 'archivo']

but tag.attrs find nothing


OK, with jcollado's help I could get the list this way:

imageslist = []
patron = re.compile(r"archivo='([\w\./]+)'")
for tag in soup.findAll('param'):
    if patron.search(tag['value']):
        imageslist.append(patron.search(tag['value']).group(1))
Community
  • 1
  • 1
Antonio
  • 61
  • 1
  • 7

1 Answers1

1

The problem here is that archivo isn't an attribute of param, but something inside the value attribute. To extract archivo from value, I suggest to use a regular expression as follows:

>>> archivo_regex = re.compile(r"archivo='([\w\./]+)'")
>>> [archivo_regex.search(tag['value']).group(1)
... for tag in soup.findAll('param')]
[u'bla.jpg', u'bli.jpg', u'blu.jpg']
jcollado
  • 39,419
  • 8
  • 102
  • 133
  • ¿? if I try to create the list with: [archivo_regex.search(tag['value']).group(1) for tag in soup.findAll('param')] then I get : 'NoneType' object has no attribute 'group' – Antonio Feb 16 '12 at 16:58
  • That means that the regular expression didn't match as expected. From the comment above, I see you need also the slash character. I've edited my answer with it. – jcollado Feb 16 '12 at 17:04
  • the problem go on, perhaps I simplified too much the code, here it is the site (severals image files, you see): http://descartes.cnice.mec.es/heda/ASIPISA/ASIPISA_M/unidades/escalera/escalera.html' . The code stoped showing tag = (thaks for your time) – Antonio Feb 16 '12 at 17:19
  • perhaps the error is just because not always "value" has "archivo". – Antonio Feb 16 '12 at 17:28
  • neither "archivo" is always in the same position, I see my example was very bad – Antonio Feb 16 '12 at 17:41
  • I'm not able to open the web page because of permissions problems (even after removing the trailing `'`), so unless you provide a more complete example I won't be able to help. Sorry. – jcollado Feb 16 '12 at 18:26
  • sorry... http://descartes.cnice.mec.es/heda/ASIPISA/ASIPISA_M/unidades/escalera/escalera.html, is OK? – Antonio Feb 17 '12 at 14:21