-1

I am trying to use BeautifulSoup to scrape a particular download URL from a web page, based on a partial text match. There are many links on the page, and it changes frequently. The html I'm scraping is full of sections that look something like this:

<section class="onecol habonecol">
 <a href="https://longGibberishDownloadURL" title="Download">
  <img src="\azure_storage_blob\includes\download_for_windows.png"/>
 </a>
 sentinel-3.2022335.1201.1507_1608C.ab.L3.FL3.v951T202211_1_3.CIcyano.LakeOkee.tif
</section>

The second to last line (sentinel-3.2022335...LakeOkee.tif) is the part I need to search using a partial string to pull out the correct download url. The code I have attempted so far looks something like this:

import requests, re
from bs4 import BeautifulSoup

reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
result = soup.find('section', attrs={'class':'onecol habonecol'}, string=re.compile(?))

I've been searching StackOverflow a long time now and while there are similar questions and answers, none of the proposed solutions have worked for me so far (re.compile, lambdas, etc.). I am able to pull up a section if I remove the string argument, but when I try to include a partial matching string I get None for my result. I'm unsure what to put for the string argument (? above) to find a match based on partial text, say if I wanted to find the filename that has "CIcyano" somewhere in it (see second to last line of html example at top).

I've tried multiple methods using re.compile and lambdas, but I don't quite understand how either of those functions really work. I was able to pull up other sections from the html using these solutions, but something about this filename string with all the periods seems to be preventing it from working. Or maybe it's the way it is positioned within the section? Perhaps I'm going about this the wrong way entirely.

Is this perhaps considered part of the section id, and so the string argument can't find it?? An example of a section on the page that I AM able to find has html like the one below, and I'm easily able to find it using the string argument and re.compile using "Name", "^N", etc.

<section class="onecol habonecol">
 <h3>
  Name
 </h3>
</section>

Appreciate any advice on how to go about this! Once I get the correct section, I know how to pull out the URL via the a tag.

Here is the full html of the page I'm scraping, if that helps clarify the structure I'm working against.

Koelker12
  • 21
  • 5

3 Answers3

0

I believe you are overthinking. Just remove the regular expression part, take the text and you will be fine.

import requests
from bs4 import BeautifulSoup

reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
result = soup.find('section', attrs={'class':'onecol habonecol'}).text
print(result)
DoMajor7th
  • 109
  • 5
  • Thank you for the response, but I'm not sure this addresses my issue. I don't need to just see the text from whatever first section was returned, I need to find the correct section out of hundreds on the page. But perhaps I'm misunderstanding your solution. Should I try using find_all() and then loop through each item in the list, using .text? I think I might end up with the same issue, and that also seems less efficient than just grabbing the correct section from the get-go using the string argument in find(). – Koelker12 Dec 01 '22 at 23:21
  • Can you elaborate which part you would like to get exactly? – DoMajor7th Dec 02 '22 at 17:05
  • There are many sections that contain text with something that looks like this: sentinel-3.2022335.1201.1507_1608C.ab.L3.FL3.v951T202211_1_3.CIcyano.LakeOkee.tif . I want to grab the first instance of a filename that has 'CIcyano' in it. I actually ended up getting what I needed using find_all() and a for loop that breaks once it finds a match, using the .text method you suggested. I've added it as an answer. – Koelker12 Dec 02 '22 at 21:39
0

You can query inside every section for the string you want. Like so:

s.find('section', attrs={'class':'onecol habonecol'}).find(string=re.compile(r'.sentinel.*'))

Using this regular expression you will match any text that has sentinel in it, be careful that you will have to match some characters like spaces, that's why there is a . at beginning of the regex, you might want a more robust regex which you can test here: https://regex101.com/

  • Thanks for the response. I'm seeing the same result though. It works for other sections, for example when I search for 'Name' within the re.compile(), and I will get the section shown in the second HTML example in my post. But when searching for sections with the sentinel filenames, like your suggestion, I still get a None result. Using https://regex101.com/ (cool resource btw, thanks) it is showing that I should be able get it using the regular expression r'sentinel.*' , among many others, but when I run it in my code I still get None. – Koelker12 Dec 01 '22 at 23:13
0

I ended up finding another method not using the string argument in find(), instead using something like the code below, which pulls the first instance of a section that contains a partial text match.

sections = soup.find_all('section', attrs={'class':'onecol habonecol'})


for s in sections:
    text = s.text
    if 'CIcyano' in text:
        print(s)
        break

links = s.find('a')
dwn_url = links.get('href')

This works for my purposes and fetches the first instance of the matching filename, and grabs the URL.

Koelker12
  • 21
  • 5