0

I am trying to webscrape some parts of this page: https://markets.businessinsider.com/stocks/bp-stock using BeautifulSoup to search for some text contained in h2 title of tables

when i do:

data_table = soup.find('h2', text=re.compile('RELATED STOCKS')).find_parent('div').find('table')

It correctly get the table I am after.

When I try to get the table "Analyst Opinion" using the similar line, it returns None:

data_table = soup.find('h2', text=re.compile('ANALYST OPINIONS')).find_parent('div').find('table')

I am guessing that there might be some special characters in the html code, that provides re to function as expected. I tried this too:

data_table = soup.find('h2', text=re.compile('.*?STOCK.*?INFORMATION.*?', re.DOTALL))

without success.

I would like to get the table that contain this bit of text "Analyst Opinion" without finding all tables but by checking if contains my requested text.

Any idea will be highly appreciated. Best

Je Je
  • 508
  • 2
  • 8
  • 23

1 Answers1

1

You can use CSS selector to locate the <table>:

import requests
from bs4 import BeautifulSoup

url = 'https://markets.businessinsider.com/stocks/bp-stock '

soup = BeautifulSoup(requests.get(url).text, 'lxml')

table = soup.select_one('div:has(> h2:contains("Analyst Opinions")) table')

for tr in table.select('tr'):
    print(tr.get_text(strip=True, separator=' '))

Prints:

2/26/2018 BP Outperform RBC Capital Markets
9/22/2017 BP Outperform BMO Capital Markets

More about CSS selectors here.


EDIT: For canse-insensitive method, you can use bs4 API with regular expressions (note the flags=re.I). This is the equivalent of .select() method above:

import re
import requests
from bs4 import BeautifulSoup

url = 'https://markets.businessinsider.com/stocks/bp-stock '

soup = BeautifulSoup(requests.get(url).text, 'lxml')

h2 = soup.find(lambda t: t.name=='h2' and re.findall('analyst opinions', t.text, flags=re.I))
table = h2.find_parent('div').find('table')

for tr in table.select('tr'):
    print(tr.get_text(strip=True, separator=' '))
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Great thanks. Out of curiosity wondered why re.compile was working for some but all tables... But looking at your code, it seems much more elegant anyway. – Je Je Aug 20 '19 at 17:49
  • @NonoLondon The `"Analyst Opinions"` is not all upper case - it's only styled by CSS to appear all uppercase I think. You can try to add `flags=re.I` to your regex to ignore case. – Andrej Kesely Aug 20 '19 at 17:50
  • so I tried: ```data_table = soup.find('h2', text=re.compile('Analyst.*?Opinions', flags=re.I))``` and ```data_table = soup.find('h2', text=re.compile('Analyst Opinions', flags=re.I))``` and both return None – Je Je Aug 20 '19 at 18:14
  • 1
    @NonoLondon I was able to find the tag `

    ` with this command: `data_table = soup.find(lambda tag: tag.name=='h2' and re.findall(r'analyst opinion', tag.text, flags=re.I))`

    – Andrej Kesely Aug 20 '19 at 18:25
  • I have noticed that the contains method is case sensitive, is there a way to make it non-case sensitive please? – Je Je Aug 28 '19 at 18:11