4

I am very new to Python programming and I am trying to make a simple application.

What I'm trying to do is search for a text on Google and return the links, my program does this fine. The other thing is if Google has the quick answer like in the photo below, I want to grab it, and this is where my problem lies. I tried searching online and found very few topics in which none of the codes work.

Google Quick box answer:

By examining the code of many pages I noticed that the answer is always in a class called _XWk but in Python when I get the code of the page and search for this class it doesn't find it. I tried so many ways of scraping the page in Python, but it never gets this class and I think the code it gets is less than the code the browser shows me when I open page source code.

Class _XWk:


Code:

import requests, lxml
from bs4 import BeautifulSoup

url = 'https://www.google.com/search?q=when%20was%20trump%20born'
h = {"User-Agent":"Chrome/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}

r = requests.get(url, headers=h).text
soup = BeautifulSoup(r,'lxml')

soup.find_all("div", class_="_xwk")
print (soup)

Any help is appreciated.

Alaa_kamal
  • 53
  • 1
  • 4
  • 1
    Have you tried the answer on this similar question? https://stackoverflow.com/questions/31798009/is-there-an-api-for-the-google-answer-boxes – pastaleg Mar 14 '18 at 01:25

3 Answers3

4

The line soup.find_all("div", class_="_xwk") has no effect in your code. The find_all() function returns a list of tags that match the given parameters. So, you need to save this result in a variable.

But, as you need only one such tag, you can use find(), which returns the first tag match.

Finally, to get the text inside a tag, you've to use the .text attribute.

Also, the class name is case sensitive. In the inspection, the class name is _XWk and not _xwk. Making these changes, the code:

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'}
r = requests.get('https://www.google.com/search?q=when%20was%20trump%20born', headers=headers)
soup = BeautifulSoup(r.text, 'lxml')

result = soup.find('div', class_='_XWk')
print(result.text)
# 14 June 1946 (age 71)
Keyur Potdar
  • 7,158
  • 6
  • 25
  • 40
  • Thank you so much its working! , but i have another problem now which i have been trying to solve for the past couple of hours, the requests.get opens google in the country language so the result comes out in Arabic and sometimes it doesn't at all because the box never appears however it appears just fine if i searched in my English Language Chrome, to make you understand what my problem is this what i am getting in the first line of the variable soup I want to get rid of this lang="ar" – Alaa_kamal Mar 14 '18 at 06:16
  • You need to add `hl=en` param to your link. `https://www.google.com/search?q=when trump is born&hl=en`will give you what you want. – Dmitriy Zub Apr 07 '21 at 12:58
3

The most often problem why you don't see the same results as in your browser is because there's no user-agent being passed into request headers thus when no user-agent is specified while using requests library, it defaults to python-requests and Google understands that it's a bot/script, then it blocks a request (or whatever it does).

That's why you receive a different HTML with different CSS selectors and some sort of an error. Check what's your user-agent.


If you need to find one specific element, you can use SelectorGadget to find CSS selectors and use select_one() to make it work:

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

html = requests.get('https://www.google.com/search?q=when trump is born', headers=headers).text

soup = BeautifulSoup(html, 'lxml')

born = soup.select_one('.XcVN5d').text
age = soup.select_one('.kZ91ed').text

print(born, age, sep='\n')

Output:

June 14, 1946
age 74 years

Alternatively, you can do the same thing using Google Direct Answer Box API from SerpApi. It's a paid API with a free plan.

The biggest difference is that you only need to focus on the data you want to extract, rather than figuring out what selectors to use and how to bypass blocks from Google or other search engines.

Code to integrate:

from serpapi import GoogleSearch

params = {
  "api_key": "YOUR_API_KEY",
  "engine": "google",
  "q": "when trump is born",
  "google_domain": "google.com",
}

search = GoogleSearch(params)
results = search.get_dict()

date_born = results['answer_box']['answer']
print(date_born)

Output:

June 14, 1946

Disclaimer, I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35
-3

SerpApi does not support yet knowledge graph direct answers. But you can use directly the knowledge graph in your case:

$ curl https://serpapi.com/search.json?q=When+trum+was+born+?
...
  "knowledge_graph": {
    "title": "Donald Trump",
    "description": "Donald John Trump is the 45th and current President of the United States. Before entering politics, he was a businessman and television personality.\nTrump was born and raised in the New York City borough of Queens.",
    "source": {
      "name": "Wikipedia",
      "link": "https://en.wikipedia.org/wiki/Donald_Trump"
    },
    "born": "June 14, 1946 (age 72 years), Jamaica Hospital Medical Center, New York City, NY",
    "height": "6′ 3″ Trending",
    "full_name": "Donald John Trump",
    "net_worth": "3.1 billion USD (2019)",
    "parents": "Fred Trump, Mary Anne MacLeod Trump",
    "education": "Fordham University (1964–1966), New York Military Academy (1964), The Kew-Forest School"
  },
...
Hartator
  • 5,029
  • 4
  • 43
  • 73
  • 1
    If you want to suggest moving to an API, one which is commercial by the way, you should also detail why the user should consider this API, which are its potential benefits and why they outweigh any alternative implementations. Not everyone is willing to pay money for an API just to extract this information. – Marc Sances Oct 15 '20 at 06:29