I am trying to scrape this site and I am using the requests
library and BeautifulSoup
. So here's the deal folks, I simply do a requests.get()
and convert its .content
to a bs4
object. (My parser is "html5lib"
)
Now before that as you can see in the image here there is a class
attribute on this tag.
You can do an inspect element on the site by yourself to see what I am talking about. You can visit the site and right click on the boxed word and inspect element.
Now when I get the .content
of it and convert it to a bs4 object and printout the soup
variable, as you can see the <p>
tag I am pointing out in the arrows does not have the class
attribute anymore, it is just simple a <p>
. Check it out here, I already tried doing a find on sublime text the class value itself but there are no results, so this would mean that the class
attribute indeed is NOT INCLUDED.
(I cannot put here the whole value of soup since it is too long, I suggest you print it out too)
You might be wondering why do I need the class
attribute? I need it to gather the relevant data based on that class, I need it to be SPECIFICALLY based on that class since I cannot just .find()
a p
tag since there are many cases that there would be other p
tags but they are not the data I am trying to get, so I am just being precise.
Also here is the simple code I made, take note also that I already have tried putting a User-Agent on it since I've searched also to try and fake like a browser, but still no luck :( can someone help me and enlighten me also on why this is happening? Thank you!
import requests
from bs4 import BeautifulSoup
word = "opierać"
url = f"https://pl.wiktionary.org/wiki/{word}#{word}_(j%C4%99zyk_polski)"
headers = {"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"}
r = requests.get(url,headers=headers)
print("Status code: ",r.status_code)
soup = BeautifulSoup(r.content,"html5lib")
print(soup)
``` are being generated by JavaScript code so you wont be able to get it using **BeautifouSoup** neither **requests**, instead you could use **Selenium**. Or try to take that info without relying on the class.
– Gealber Nov 30 '20 at 05:03