1

I am trying to scrape this site and I am using the requests library and BeautifulSoup. So here's the deal folks, I simply do a requests.get() and convert its .content to a bs4 object. (My parser is "html5lib")

Now before that as you can see in the image here there is a class attribute on this tag. You can do an inspect element on the site by yourself to see what I am talking about. You can visit the site and right click on the boxed word and inspect element.

img1

Now when I get the .content of it and convert it to a bs4 object and printout the soup variable, as you can see the <p> tag I am pointing out in the arrows does not have the class attribute anymore, it is just simple a <p>. Check it out here, I already tried doing a find on sublime text the class value itself but there are no results, so this would mean that the class attribute indeed is NOT INCLUDED.

(I cannot put here the whole value of soup since it is too long, I suggest you print it out too)

You might be wondering why do I need the class attribute? I need it to gather the relevant data based on that class, I need it to be SPECIFICALLY based on that class since I cannot just .find() a p tag since there are many cases that there would be other p tags but they are not the data I am trying to get, so I am just being precise.

img2

Also here is the simple code I made, take note also that I already have tried putting a User-Agent on it since I've searched also to try and fake like a browser, but still no luck :( can someone help me and enlighten me also on why this is happening? Thank you!

import requests
from bs4 import BeautifulSoup

word = "opierać"

url = f"https://pl.wiktionary.org/wiki/{word}#{word}_(j%C4%99zyk_polski)"
headers = {"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"}
r = requests.get(url,headers=headers)

print("Status code: ",r.status_code)
soup = BeautifulSoup(r.content,"html5lib")
print(soup)
Ice Bear
  • 2,676
  • 1
  • 8
  • 24
  • This might be helpful : https://stackoverflow.com/questions/3364279/has-anyone-parsed-wiktionary – Harshana Nov 30 '20 at 03:56
  • Thanks but the API doesn't seem to have the task to gather data from a particular page like that one I have provided... I think it seems different :( – Ice Bear Nov 30 '20 at 04:32
  • Would help if this was in English. I don't get it. You want `p` elements from page which have no `class` , but you want `class`? – Abhishek Rai Nov 30 '20 at 04:53
  • It seems like the classes of that tag ```

    ``` are being generated by JavaScript code so you wont be able to get it using **BeautifouSoup** neither **requests**, instead you could use **Selenium**. Or try to take that info without relying on the class.

    – Gealber Nov 30 '20 at 05:03
  • we don't need it to be in english cause I also don't know how to speak polish :D Yeess I am wondering the same thing.. it could be because of javascript.. is there a way to have it? as you can see on the image the class attribute is there but when I get it using requests and parse it ... the class attribute is gone. – Ice Bear Nov 30 '20 at 05:04
  • as @Gealber said, you could use selenium to open it and then pass the code to beautifulsoup, also i think its not necessary to have headers=headers in line 8 – Timeler Nov 30 '20 at 07:26
  • Yuup I've already thought of that but to be honest I can't have another third party program in it since I'll be using chromedriver.exe, but thanks anyways! I'll look onto it if I really have to – Ice Bear Nov 30 '20 at 07:48

2 Answers2

2

As mentioned by @Gealber, I would also recommend Selenium to perform your task.

Example

from selenium import webdriver
from bs4 import BeautifulSoup

word = "opierać"
url = f"https://pl.wiktionary.org/wiki/{word}#{word}_(j%C4%99zyk_polski)"

driver = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
driver.get(url)
driver.implicitly_wait(3) 

soup = BeautifulSoup(driver.page_source,"html5lib")
p = soup.find('p', class_='lang-pl fldt-znaczenia')
print(p)
driver.close()

Used an implicit wait for fully load of website (More about selenium waits):

driver.implicitly_wait(3) 

Output

<p class="lang-pl fldt-znaczenia"><i class="lang-pl fldt-znaczenia">czasownik przechodni niedokonany</i> (<link class="lang-pl fldt-znaczenia" href="mw-data:TemplateStyles:r6240524" rel="mw-deduplicated-inline-style"/><span class="short-container lang-pl fldt-znaczenia"><a class="mw-redirect lang-pl fldt-znaczenia" href="/wiki/Aneks:Skr%C3%B3ty_u%C5%BCywane_w_Wikis%C5%82owniku#D" title="Aneks:Skróty używane w Wikisłowniku"><span class="short-wrapper lang-pl fldt-znaczenia" data-expanded="aspekt dokonany" title="aspekt dokonany"><span class="short-content lang-pl fldt-znaczenia">dk.</span></span></a></span> <a class="lang-pl fldt-znaczenia" href="/wiki/opra%C4%87#pl" title="oprać">oprać</a>)
</p>

You can also go for all of the p tags with that special class and loop through:

soup.find_all('p', class_='lang-pl fldt-znaczenia')
HedgeHog
  • 22,146
  • 4
  • 14
  • 36
  • Thanks! actually my last choice to do this is using Selenium just like what you have done... but I'll still do more research and other ways I could think of... since I am trying to do this without any other 3rd party software. – Ice Bear Nov 30 '20 at 07:50
  • 1
    I'll check this once I decide to do selenium. – Ice Bear Nov 30 '20 at 07:50
  • I like the answer of @QHarr! but for the question I am referring I think the best way is to have it on selenium.. Thanks a lot guys! – Ice Bear Dec 01 '20 at 03:22
1

You can select on the basis the p tag you want has a child i tag (requires bs4.7.1+) and is the first to match this pattern.

import requests
from bs4 import BeautifulSoup as bs

soup = bs(requests.get('https://pl.wiktionary.org/wiki/opiera%C4%87#opiera%C4%87(j%C4%99zyk_polski').content, 'lxml')
print(soup.select_one('p:has(i)').text)
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Wooow! Thank you! I did not know that this is possible! this is a good way to make my code precise too.... may I ask can I put anything I mean anything inside the `has`? for example for ` – Ice Bear Dec 01 '20 at 03:21
  • This one I consider to be a checked answer too! – Ice Bear Dec 01 '20 at 03:23
  • But there could only be one answer... but this I consider to be `another` alternative solution for my problem. Thanks! – Ice Bear Dec 01 '20 at 04:16
  • 1
    @QHarr Great solution to work without any `id, name or class` attributes - I also take a closer look. – HedgeHog Dec 01 '20 at 06:57
  • @StackOffended Yes you can put label inside. You can nest css selectors inside :has – QHarr Dec 01 '20 at 07:53
  • 1
    Thanks! So I can put any possible tag I want there? Great! – Ice Bear Dec 01 '20 at 09:25