-2

I am new to web scraping. I want to scrape the data (comments and respective dates) from this web page https://forum.donanimhaber.com/turkcell-25-000-dk-12-000-sms-70-gb-internet-12-ay-320-tl-ana-konu--151777938 It has pagination for pages.... This is the way I am doing

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
import json
AllEntries = []
with sync_playwright() as p:
    browser = p.chromium.launch(headless=False,slow_mo=50)
    noofforumpagesvodafone = 1000
    currentpage = 1
    page = browser.new_page()
    page.goto('https://search.donanimhaber.com/?q=vodafone&p='+ str(currentpage) + '&token=-1&hash=56BB9D1746DBCDA94D0B1E5825EFF47D&order=date&in=all&type=both&scope=all&range=all',timeout = 0)
    html = page.inner_html("div.results")
    soup = BeautifulSoup(html, 'html.parser')
    xx = [x.get('href') for x in soup.find_all('a')]

    xxi = 0
    time = []
    while(xxi<1):
        if(xx[xxi][0] == "/"):
            entry = []
            # page.goto('https://search.donanimhaber.com' + str(xx[xxi]),timeout = 0)
            page.goto("https://forum.donanimhaber.com/turkcell-25-000-dk-12-000-sms-70-gb-internet-12-ay-320-tl-ana-konu--151777938")

            html = page.inner_html("div.kl-icerik")
            soup = BeautifulSoup(html, 'html.parser')

            for table in soup.findAll('div', {'class': 'ki-cevapicerigi'}):
                for t in table.findAll('span', {'class': 'mButon info'}):
                    print(t.text)

                for links in table.findAll('span', {'class': 'msg'}):
                     for link in links.findAll('td'):
                          print(link.text)
                     for linko in links.findAll('p'):
                          print(linko.text)

This code is working only on first page its give all comments and dates accordingly..but not from page 2.3.4..... which appears as we scroll to the buttom

How can I do that ...Thank you

Samuel Liew
  • 76,741
  • 107
  • 159
  • 260

2 Answers2

0

In your special case, each page has their own link. It is your base link and the page number with an hyphen (-) in between.

You can see this behaviour when clicking on the second page, compare your base-link with the link you have now: https://forum.donanimhaber.com/turkcell-25-000-dk-12-000-sms-70-gb-internet-12-ay-320-tl-ana-konu--151777938-2

(notice the -2 at the end)

One way to do it, would be to change your url in a for-loop, iterating up to 24 and scrape all of those pages individually.

Dugnom
  • 342
  • 1
  • 5
  • 12
  • The problem with this approach is this is just one page actually I want to scrarp https://search.donanimhaber.com/?q=vodafone&p=1&token=-1&hash=56BB9D1746DBCDA94D0B1E5825EFF47D&order=date&in=all&type=both&scope=all&range=all this ...The above page is one of the many pages....To do with your suggested approach I want some way to get no of pages it has like in the above you said from 1 to 24 for each on the page..how can I get that....... –  Mar 25 '22 at 06:34
  • How did you got that it contains 24 pages..Is it there in html of the first page? –  Mar 25 '22 at 06:38
  • 1
    What happens if you go to a higher page number than the last one? You get redirected to the highest one which exists. This gives me three ideas: 1. You could enter a really high number every time and check the page number you get redirected to. 2. You could increase the page number until you're output repeats. 3. You could increase the page number until the page you get redirected to no longer matches the page you tried to open. – Dugnom Mar 25 '22 at 06:45
  • I tried searching for 24 in the source code and found ' data-maxpage="24" ' in the html, you probably could also read and use that. – Dugnom Mar 25 '22 at 06:49
  • Where is it? How did you search that? Can you share an img –  Mar 25 '22 at 07:01
  • 1
    You can open the source-code by either pressing ctrl+u or rightclicking the page and open source code. Then just press ctrl+f and enter data-maxpage – Dugnom Mar 25 '22 at 07:04
-1

You can do it more easy. You do not need to open the browser, just do a simple POST request attacking API of this website.
There you have api request uri: https://search.donanimhaber.com/api/search/messages/?q=vodafone&p=1&order=date&in=all&type=both&scope=all&daterange=all.
You can change some params:
q= word of your search
p= pagination.
Also with playwright 1.20.0 you can attack API.
https://playwright.dev/python/docs/api/class-apiresponse.
And it will give you json response like this.

      {
        "forumId": 600,
        "id": 152365149,
        "topicId": 102657976,
        "newsId": 0,
        "dateCreated": "2022-03-25T22:51:42.407304+03:00",
        "dateString": "3 dakika önce",
        "body": "Yenilenen 9 GB var o da 44 den 58 olmuş. Son durumda hangi tarifler var fiyatları neler güncel bir tablo olsa içinden seçsek güzel olur",
        "forumTitle": "Cep Telefonu ve Operatörler",
        "subject": "<span class='highlight'>Vodafone</span>dan Gizli Tarifeler! (İlle de <span class='highlight'>Vodafone</span> kullanacağım diyenlere.)",
        "imageUrl": null,
        "subResults": [
            {
                "forumId": 0,
                "id": 152364084,
                "topicId": 0,
                "newsId": 0,
                "dateCreated": "2022-03-25T21:03:18.1226493+03:00",
                "dateString": "1 saat önce",
                "body": "Benim 26 liralık saçma güzel 2+ tarifem 37 lira olmuş.Daha düşük fiyata bir şeyler var mı ? 1 gb int...",
                "forumTitle": null,
                "subject": null,
                "imageUrl": null,
                "subResults": null
            },
            {
                "forumId": 0,
                "id": 152362724,
                "topicId": 0,
                "newsId": 0,
                "dateCreated": "2022-03-25T18:17:14.8390711+03:00",
                "dateString": "4 saat önce",
                "body": "Allah aşkına daha yeni geçtik bi dur be..",
                "forumTitle": null,
                "subject": null,
                "imageUrl": null,
                "subResults": null
            },
            {
                "forumId": 0,
                "id": 152362447,
                "topicId": 0,
                "newsId": 0,
                "dateCreated": "2022-03-25T17:35:51.9730334+03:00",
                "dateString": "5 saat önce",
                "body": "Olay 5 gb da 41 lira olacakmış yuh artık .",
                "forumTitle": null,
                "subject": null,
                "imageUrl": null,
                "subResults": null
            },
            {
                "forumId": 0,
                "id": 152360755,
                "topicId": 0,
                "newsId": 0,
                "dateCreated": "2022-03-25T13:49:20.9403583+03:00",
                "dateString": "9 saat önce",
                "body": "demin mesaj geldi olay15gb 65 tl olarak güncellenecektir. son 3 ayda rahat 25 tl zam geldi tarifeme ...",
                "forumTitle": null,
                "subject": null,
                "imageUrl": null,
                "subResults": null
            },
            {
                "forumId": 0,
                "id": 152360644,
                "topicId": 0,
                "newsId": 0,
                "dateCreated": "2022-03-25T13:31:47.8799036+03:00",
                "dateString": "9 saat önce",
                "body": "Kazançlı 7 GB tarifesinin fiyatı 31.03'ten sonra 55 TL olacakmış. Bu nasıl zam, yazıklar olsun.",
                "forumTitle": null,
                "subject": null,
                "imageUrl": null,
                "subResults": null
            },
            {
                "forumId": 0,
                "id": 152359613,
                "topicId": 0,
                "newsId": 0,
                "dateCreated": "2022-03-25T11:18:20.4112563+03:00",
                "dateString": "11 saat önce",
                "body": "Kolay gelsin.\nTaahhütümün bitmesine 20 gün kala dijital asistanın bana önerdiği tarifeye geçiş yaptı...",
                "forumTitle": null,
                "subject": null,
                "imageUrl": null,
                "subResults": null
            }
        ]
    },
MeT
  • 675
  • 3
  • 6
  • 21