I wrote some code to search for all tags matching any of a list of values, and then get a sibling tag if True. When I search for values one by one, the output is ok, but when I search for all together, some are missing...I supose it should be an error about re.compile(), but I don't know which one.
Any help will be appreciated, Thanks in advance!
link_economics=[]
number_contracts=len(soup.find_all('entry'))
for i in range(0,number_contracts):
try:
link_list = list()
economic_list=['Apertura econ(o|ó)mica','criterios evaluables mediante f(o|ó)rmulas']
eco_list=re.compile('(.*{0}.*)'.format('|'.join(economic_list)),re.I)
for link_1_tags in soup.find_all('entry')[i].find('cac-place-ext:ContractFolderStatus').find_all('cac-place-ext:GeneralDocument'):
if eco_list.match(link_1_tags.find('cac-place-ext:GeneralDocumentDocumentReference').find('cac:Attachment').find('cac:ExternalReference').find('cbc:FileName').get_text()):
link_1_tags_1=link_1_tags.find('cac-place-ext:GeneralDocumentDocumentReference').find('cac:Attachment').find('cac:ExternalReference').find('cbc:URI').get_text()
link_list.append(link_1_tags_1)
else:
continue
link_economics.append(link_list)
except:
link_economics.append('NaN')
An example of the file structure would be:
<entry>
<cac-place-ext:ContractFolderStatus>
<cac-place-ext:GeneralDocument>
<cac-place-ext:GeneralDocumentDocumentReference>
<cac:Attachment>
<cac:ExternalReference>
<cbc:URI>https://...</cbc:URI>
<cbc:FileName>Informe valoración criterios evaluables mediante fórmulas</cbc:FileName>
An extended example (zip file from the Spanish Treasury) can be found here: