-2

I have to use requests_html for JavaScript content. Code:

<td class="text-left worker-col truncated"><a href="/account/0x58e0ff2eb3addd3ce75cc3fbdac3ac3f4e21fa/38-G1x" style="color:red">38-G1</a></td>

I want to find all names (38-G1 in this case) with red color. I want to seach them by style="color:red". Is this possible with requests_html? How I can do this?

Dixy
  • 3
  • 1

2 Answers2

0

Edit: in this case the styling is added in by javascript after page load, so you have to wait for the whole page to load before scraping it, so Selenium is the way to go.

You can grab the page this way, just like Fazlul did it:

from bs4 import BeautifulSoup as bs
import time
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)

driver.get("URL")
time.sleep(5)

html = bs(driver.page_source, 'html.parser')

then you can either use a CSS wildcard selector, then print out their innerText:

anchors = html.select('a[style*="color:red"]')

print([a.text for a in anchors])

OR

You could find all <a> tags and put them in a list if they have that attribute.

anchors = html.select('a')

names = []
for a in anchors:
    if 'style' in a.attrs and "color:red" in a.attrs['style']:
        names.append(a.text)

Edit: I see an other user gave you a solution with BeautifulSoup and I'd like to add that if you're new to webscraping, but you plan on learning more, I'd also recommend learning to use BeautifulSoup. It's not only more powerful, but it's user base is much larger, so it's easier to find solutions for your problem.

zoltankundi
  • 181
  • 1
  • 1
  • 11
0

I do use both html session and selenium with bs4. Selenium works fine but html session is unable to render js.

Code with selenium.(Success)

from bs4 import BeautifulSoup
import time
from selenium import webdriver


driver = webdriver.Chrome('chromedriver.exe')
url = URL
driver.get(url)
time.sleep(8)

soup = BeautifulSoup(driver.page_source, 'html.parser')
for t in soup.select('table.table.table-bordered.table-hover.table-responsive tr'):
    txt= t.select_one('td:nth-child(2) > a')
    text= txt.text if txt else None
    print(text)

Output:

38-G15
47_G15_2   
47-G1      
49-O15     
90_GGX     
91_ASF     
105_MGPM_3 
112-GG3    
121-APRO   
188-MGPM1  
198-AP     
248_MGPM_1 
262-GUD    
265_ASF    
302-AD     
355-GUD.2  
Rig_3471855
rigEdge    
107_MGPM_3 
None
None

Code with html session(not rendering js)

    from bs4 import BeautifulSoup
    from requests_html import HTMLSession
    session = HTMLSession()
    response = session.get(URL)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    for t in soup.select('table.table.table-bordered.table-hover.table-responsive tr'):
        txt= t.select_one('td:nth-child(2) > a')
        text= txt.text if txt else None
        print(text)
Md. Fazlul Hoque
  • 15,806
  • 5
  • 12
  • 32