0

I want to get the number of indexed pages for certain domains. Therefore I want to use the "site:" parameter and extract the number of results from the search result page.

I tried it with a Google-apps-script for Google spreadsheets:

function sampleFormula_4() {
  const url = "https://www.google.com/search?q=site%3Abenedikt-sahlmueller.de";
  
  try {
    const html = UrlFetchApp.fetch(url).getContentText();
    return html.match(/<div id="result-stats">(.+?)nobr>/)[1].trim();

  } catch (e) {
    Utilities.sleep(5000);
    const html = UrlFetchApp.fetch(url).getContentText();
    return html.match(/<div id="result-stats">(.+?)nobr>/)[1].trim();
  }
}

Google Spreadsheet gives me an error 429 - too many requests. I integrated a sleep-time of 5000ms, but Google Search still returns error 429.

All I need is the number of pages for certain URLs in Google's search results. Maybe there is a better way - I can't use the search-api for this as those pages are not part of my GSC.

Vin
  • 968
  • 2
  • 10
  • 22
s.Panse
  • 37
  • 9

1 Answers1

1

Most likely Google Search is considering requests coming from UrlFetch as automated traffic and hence blocking them. From the official docs:

What Google considers automated traffic

  • Sending searches from a robot, computer program, automated service, or search scraper

The same behaviour happens when using tools like wget or curl, for example.

Using the Search API is recommended.

Related:

Iamblichus
  • 18,540
  • 2
  • 11
  • 27
  • Thanks for your reply @lamblichus. Do you know if there is a way I can use the search-api to access data for domains I dont "own" as a propetry in GSC? – s.Panse Mar 02 '21 at 16:26
  • 1
    @s.Panse I don't think you need to "own" a domain for the API I mentioned. Just follow the steps mentioned in [https://stackoverflow.com/a/30041104/](this answer) and, once you have the search engine ID, use it to call [cse.list](https://developers.google.com/custom-search/v1/reference/rest/v1/cse/list) or [cse.siterestrict.list](https://developers.google.com/custom-search/v1/reference/rest/v1/cse.siterestrict/list), setting up `cx` to this ID. – Iamblichus Mar 03 '21 at 11:25
  • thank you very much, I'll give it a try asap! – s.Panse Mar 03 '21 at 20:06
  • Hi @Iamblichus! I opened a "Google Apps Script", a new programmable search engine and added ID and API-key to the script. Then I called the function via Google Spreadsheet (= functionname (query)) but I dont get any reuslts back to Spreadsheet. There is a console ouput when I run the script directly and it looks okay to me. Can you give me a hint how I could use the output in spreadsheet? – s.Panse Mar 06 '21 at 17:13
  • Okay, I used `return (JSON.stringify(result));` to get some data to spreadhseet. As far as I understand I can't use "site:" in CSE to obtain the number of pages for a certain domain or host? – s.Panse Mar 07 '21 at 16:37
  • I think I got it: `aresult = JSON.stringify(result).match(/"totalResults":"(.+?)"/)[1].trim();` gives me the number of search results from CSE. I also had to update CSE settings in order to get the index-coverage from "site:": "All languages", "all countries", "search the whole web". – s.Panse Mar 07 '21 at 21:55
  • HI @Iamblichus! Do you by any chance have an idea on how I could save up some budget for api-calls? Right now I have to call a lot of data for every domain I enter - all I need is the number of page-results from a "page:"-query. – s.Panse Mar 16 '21 at 18:47
  • @s.Panse Can you provide the code related to the request you are making? Also, I'd suggest posting a new question for this, in order to give it more visibility. – Iamblichus Mar 17 '21 at 08:35
  • Hi @Iamblichus, thanks agian for your help! I opened a neq question https://stackoverflow.com/q/66709930/. Lets see what happens. Have a nice weekend! – s.Panse Mar 19 '21 at 14:25