3

How can I extract a

featured snippet

from a Google search results page?

  • 1
    Do you mean Custom Search? Web search doesn't have an API anymore, and extracting content automatically is forbidden by their terms... "You specifically agree not to access (or attempt to access) any of the Services through any automated means (including use of scripts or web crawlers)..." – samiles May 15 '17 at 10:35
  • Yes, I meant to the Custom Search. – Yifat Biezuner May 15 '17 at 12:59
  • In that case, yes, the XML result returned by Custom Search can include all metadata you want. [The full docs are here.](https://developers.google.com/custom-search/docs/snippets) - You essentially need to define your own response format for Google to start returning what you need. [This page of the docs](https://developers.google.com/custom-search/docs/structured_data#viewing-extracted-structured-data) explains how to add data and test it. – samiles May 15 '17 at 13:35
  • I tried with cx value of "thing" and it gave me irrelevant results. Do you have recommended entity types to use in the generated cx? – Yifat Biezuner May 15 '17 at 13:48
  • @YifatBiezuner got anything on this? – Arpit Suthar Dec 26 '17 at 08:34

1 Answers1

0

If you want to scrape Google Search Results Snippet you can use BeautifulSoup web scraping library, but with this solution, problems can arise if a lot of requests are made.

You can try to solve the blocking issue by adding headers where your user-agent will be specified, this is necessary for Google to recognize the request as from a user, and not as from a bot, and not block it:

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}

An additional step could be to rotate user-agents.

The code example below shows a solution using pagination to get more values. You can paginate all pages using an infinite while loop. Pagination is possible as long as the next button exists (determined by the presence of a button selector on the page, in our case the CSS selector ".d6cvqb a[id=pnnext]", you need to increase the value of ["start"] by 10 to access the next page, if present, otherwise, we need to exit the while loop:

if soup.select_one('.d6cvqb a[id=pnnext]'):
    params["start"] += 10
else:
    break

Check code in online IDE.

from bs4 import BeautifulSoup
import requests, json, lxml

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "python",       # query example
    "hl": "en",          # language
    "gl": "us",          # country of the search, US -> USA
    "start": 0,          # number page by default up to 0
    #"num": 100          # parameter defines the maximum number of results to return.
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}

page_num = 0

website_data = []

while True:
    page_num += 1
    print(f"page: {page_num}")
        
    html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, 'lxml')
    
    for result in soup.select(".tF2Cxc"):
        title = result.select_one(".DKV0Md").text
        try:
          snippet = result.select_one(".lEBKkf").text
        except:
          snippet = None
                    
        website_data.append({
              "title": title,
              "snippet": snippet  
        })
      
    if soup.select_one('.d6cvqb a[id=pnnext]'):
        params["start"] += 10
    else:
        break

print(json.dumps(website_data, indent=2, ensure_ascii=False))

Example output:

[
    {
    "title": "Welcome to Python.org",
    "snippet": "The official home of the Python Programming Language."
  },
  {
    "title": "Python (programming language) - Wikipedia",
    "snippet": "Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation."
  },
  {
    "title": "Python Courses & Tutorials - Codecademy",
    "snippet": "Python is a general-purpose, versatile, and powerful programming language. It's a great first language because Python code is concise and easy to read."
  },
  {
    "title": "Python - GitHub",
    "snippet": "Repositories related to the Python Programming language - Python. ... Collection of library stubs for Python, with static types. Python 3.3k 1.4k."
  },
  {
    "title": "Learn Python - Free Interactive Python Tutorial",
    "snippet": "learnpython.org is a free interactive Python tutorial for people who want to learn Python, fast."
  },
  # ...
]

You can also use Google Search Engine Results API from SerpApi. It's a paid API with the free plan. The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.

Code example:

from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os

params = {
  "api_key": os.getenv("API_KEY"), # serpapi key
  "engine": "google",              # serpapi parser engine
  "q": "python",                   # search query
  "num": "100"                     # number of results per page (100 per page in this case)
  # other search parameters: https://serpapi.com/search-api#api-parameters
}

search = GoogleSearch(params)      # where data extraction happens

organic_results_data = []

while True:
    results = search.get_dict()    # JSON -> Python dictionary
    
    for result in results["organic_results"]:
        organic_results_data.append({
            "title": result.get("title"),
            "snippet": result.get("snippet")   
        })
    
    if "next_link" in results.get("serpapi_pagination", []):
        search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
    else:
        break
    
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))

The output is exactly the same as in bs4's answer.

Denis Skopa
  • 1
  • 1
  • 1
  • 7