1

So I have a list with about 300 URLs from a webpage that I want to scrape once a day.

So far I am blocked in about 5 min after first webpage scraping.

How can I clear cookies and Internet files each time I navigate to the URL?

Is there a way to simulate kind of more like a human behavior?

With ie
    Set wks = wb.Sheets("TESTINGS")

    For j = 22 To 45
    wks.Range(Cells(j, 1), Cells(j, 5)).Interior.ColorIndex = 38
    counter = counter + 1

    mylink1 = wks.Cells(j, 2).Value

    .Visible = True
    .Navigate mylink1

    While .Busy Or .ReadyState < 4: DoEvents: Wend


    Set products = .Document.querySelectorAll(".card.js-product-card")
    t = Timer

    Do
        DoEvents
        ie.Document.parentWindow.execScript "window.scrollBy(0, window.innerHeight);", "javascript"
        Set finalPrices = .Document.querySelectorAll(".card.js-product-card span.final-price")
        Application.Wait Now + TimeSerial(0, 0, 3)
        If Timer - t > MAX_WAIT_SEC Then Exit Do
    Loop Until finalPrices.Length = products.Length

    Set sellers = .Document.querySelectorAll(".card.js-product-card .shop.cf a[title]")
    Set availability = .Document.querySelectorAll(".card.js-product-card span.availability")
    Set pname = .Document.querySelectorAll(".location-tab")
    With ThisWorkbook.Worksheets("TESTINGS")

        For i = 0 To sellers.Length - 1
If availability.Item(i).innerText = "¢ìåóç ðáñáëáâÞ / ÐáñÜäïóç 1 Ýùò 3 çìÝñåò" Then               
            .Cells(j, 4) = sellers.Item(i)
            .Cells(j, 5) = finalPrices.Item(i).innerText
            .Cells(j, 6) = availability.Item(i).innerText
            .Cells(j, 7) = pname.Item(i).innerText                              
            Exit For              
            End If
   wks.Range(Cells(i, 1), Cells(i, 5)).Interior.ColorIndex = 0
        'MsgBox ("Link Number " & j & "They left " & MaxNumber - counter)
        Application.StatusBar = "Current row " & j & " Progress: " & counter & " of " & MaxNumber & " " & Format(j / MaxNumber, "0%")

        Next

    End With

wks.Range(Cells(j, 1), Cells(j, 5)).Interior.ColorIndex = 0
Next
Call TransferDataFromColumnE17(ThisWorkbook.Worksheets("TESTINGS"))
.Quit
Set ie = Nothing

End With
pgSystemTester
  • 8,979
  • 2
  • 23
  • 49
Maria Georgali
  • 629
  • 1
  • 9
  • 22

2 Answers2

1

Edit: I just noticed that this is an older post, but better late than never I suppose.

If you want to stick with VBA, You can call the sub below:

Public Sub Clear_Browser()
    'clears temporary Internet files
    Shell "RunDll32.exe InetCpl.Cpl, ClearMyTracksByProcess 8"
    'deletes all history
    Shell "RunDll32.exe InetCpl.cpl,ClearMyTracksByProcess 255"
    'deletes cookies
    Shell "RunDll32.exe InetCpl.Cpl, ClearMyTracksByProcess 2"
    'deletes form data
    Shell "RunDll32.exe InetCpl.cpl,ClearMyTracksByProcess 16"
    'deletes password history
    Shell "RunDll32.exe InetCpl.cpl,ClearMyTracksByProcess 32"
    'deletes history
    Shell "RunDll32.exe InetCpl.Cpl, ClearMyTracksByProcess 1"
End Sub

You can call it in your For Loop to clear every 30 iterations of j like this:

  If j Mod 30 = 0 Then
         Clear_Browser
  End If

However, automating IE is really, really slow, so I would recommend an adaptation of what I did in this post where I used the MSXML2.ServerXMLhttp.6.0 Method. This method is very fast and is essentially an invisible browser that doesn't hold any cache, history, or cookies. It Sends an HTTP request to the server and receives a response of the desired data.

If you need something really fast though and just all around better, forget VBA and use Python with the BeautifulSoup Library as suggested by @ryguy72.

rickmanalexander
  • 599
  • 1
  • 6
  • 17
0

I don't see any particular URL in your post. I would suspect the site detects you as a bot, and it's blocking you. I don't think clearing the cookies, or cache, or whatever, will do anything at all. Try doing it in Python.

from bs4 import BeautifulSoup
import urllib2
url = "https://www.pythonforbeginners.com"
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)
for link in soup.find_all('a'):
    print(link.get('href'))

Export to Excel. Then you will have everything you need. It will run MUCH faster and it will be MUCH more stable as well.

ASH
  • 20,759
  • 19
  • 87
  • 200
  • what if there is no python interpreter installed and user lacks admin rights ? What if there is a group policy whitelisting only a predefined list of .exes and python isn't among them ? –  Feb 15 '19 at 15:14