_Scrape_ text after blockquote bs4

Question

I have something like this in HTML:

<p align="left"><strong><tt>
        some text:</tt></strong><tt> (8/4)</tt><a href="some link"><tt>some other text</tt></a><tt>, (9/4)</tt><a href="some other link"><tt><br/>
        some text:</tt></strong><tt>, (19/6)</tt><!--a href="some link in comment"--><tt>text after comment</tt></p></blockquote></blockquote><tt>, </tt><a href="link i want"><tt>text i want</tt></a><strong><tt><br/>
...
</p>

My code in Python:

page = requests.get(site)
soup = BeautifulSoup(page.content, 'html.parser')
rounds = soup.find('p', align="left")
matches_links = rounds.find_all('a')

I get all link to SOME COMMENT and text after. I can't get anything after </blockquote></blockquote>. These two blockquotes are invisible in page code, only when I debugging my Python code I can see it in soup. In soup I have all HTML code, but in rounds code ends on <tt>text after comment</tt></p>.

Is any way to get "link i want" and "text i want"?

Looks like the data you are looking for is dynamically added to the DOM. You should consider using a headless browser scraping using a tool like selenuim — balderman, Aug 19 '20 at 09:58
"but in rounds code ends on text after comment", This is because your
tag ends there — Gagan T K, Aug 19 '20 at 10:01

score 1 · Accepted Answer · answered Aug 19 '20 at 10:14

If you look at the HTML code, you will see that there's </p> before </blockquote></blockquote>. That means your variable rounds doesn't contain your link that you want. Search for next <a> after this <p> tag:

from bs4 import BeautifulSoup


txt = '''
<p align="left"><strong><tt>
        some text:</tt></strong><tt> (8/4)</tt><a href="some link"><tt>some other text</tt></a><tt>, (9/4)</tt><a href="some other link"><tt><br/>
        some text:</tt></strong><tt>, (19/6)</tt><!--a href="some link in comment"--><tt>text after comment</tt></p></blockquote></blockquote><tt>, </tt><a href="link i want"><tt>text i want</tt></a><strong><tt><br/>
...
</p>
'''

soup = BeautifulSoup(txt, 'html.parser')

matched_link = soup.select_one('p[align="left"] ~ a')
print(matched_link)

Prints:

<a href="link i want"><tt>text i want</tt></a>

Thank you, of course you have right. But your solution doesn't work for me, `matched_link` is empty. Could you please explain me what exactly `'p[align="left"] ~ a'` mean? — dylo, Aug 19 '20 at 10:34
@dylo `p[align="left"] ~ a` is CSS selector, it will select next `` tag preceded by a
element. You can try to `print(soup.find_all('a'))` and see, if the desired `` tag is really there. — Andrej Kesely, Aug 19 '20 at 10:37

_Scrape_ text after blockquote bs4

1 Answers1