0

I wrote some code to search for all tags matching any of a list of values, and then get a sibling tag if True. When I search for values one by one, the output is ok, but when I search for all together, some are missing...I supose it should be an error about re.compile(), but I don't know which one.

Any help will be appreciated, Thanks in advance!

link_economics=[]
number_contracts=len(soup.find_all('entry'))
for i in range(0,number_contracts):
    try: 
        link_list = list()
        economic_list=['Apertura econ(o|ó)mica','criterios evaluables mediante f(o|ó)rmulas']
        eco_list=re.compile('(.*{0}.*)'.format('|'.join(economic_list)),re.I)
        for link_1_tags in soup.find_all('entry')[i].find('cac-place-ext:ContractFolderStatus').find_all('cac-place-ext:GeneralDocument'):
            if eco_list.match(link_1_tags.find('cac-place-ext:GeneralDocumentDocumentReference').find('cac:Attachment').find('cac:ExternalReference').find('cbc:FileName').get_text()):
                link_1_tags_1=link_1_tags.find('cac-place-ext:GeneralDocumentDocumentReference').find('cac:Attachment').find('cac:ExternalReference').find('cbc:URI').get_text()
                link_list.append(link_1_tags_1)
            else:
                continue
        link_economics.append(link_list)
    except:
        link_economics.append('NaN') 

An example of the file structure would be:

<entry>
    <cac-place-ext:ContractFolderStatus> 
        <cac-place-ext:GeneralDocument> 
            <cac-place-ext:GeneralDocumentDocumentReference>
                <cac:Attachment>
                    <cac:ExternalReference>
                        <cbc:URI>https://...</cbc:URI>
                        <cbc:FileName>Informe valoración criterios evaluables mediante fórmulas</cbc:FileName>

An extended example (zip file from the Spanish Treasury) can be found here:

https://contrataciondelestado.es/sindicacion/sindicacion_643/licitacionesPerfilesContratanteCompleto3_202012.zip

PMig
  • 13
  • 4

1 Answers1

0

You could be more concise with select and use find_previous_sibling (note lower case uri). I also switched to re.search.

import re
from bs4 import BeautifulSoup

soup = BeautifulSoup('''
    <entry>
    <cac-place-ext:ContractFolderStatus> 
        <cac-place-ext:GeneralDocument> 
            <cac-place-ext:GeneralDocumentDocumentReference>
                <cac:Attachment>
                    <cac:ExternalReference>
                        <cbc:URI>https://...</cbc:URI>
                        <cbc:FileName>Informe valoración criterios evaluables mediante fórmulas</cbc:FileName>''', "lxml")


link_economics=[]
number_contracts=len(soup.find_all('entry'))
economic_list=['Apertura econ(o|ó)mica','criterios evaluables mediante f(o|ó)rmulas']
eco_list=re.compile('(.*{0}.*)'.format('|'.join(economic_list)),re.I)
  
for i in range(0,number_contracts):
    link_list = list()
    try:  
        for link_1_tag in soup.select('cbc\:FileName'):
            if re.search(eco_list, link_1_tag.get_text()):
                link_list.append(link_1_tag.find_previous_sibling('cbc:uri').text)
            else:
                continue
        link_economics.append(link_list)
    except:
        link_economics.append('NaN') 

link_economics
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • thanks...one question please: why did you switched from re.match to re.search, and from the last .get_text() to .text? What's the difference? – PMig Apr 13 '21 at 19:21
  • https://stackoverflow.com/questions/35496332 , and re.search allows for multiline – QHarr Apr 13 '21 at 19:28