-1

I have this function which is meant to return the value of how many search results were gotten for a specific word. It was working at one point, however not it only ever returns a none value. Wondering if anybody has some insight into this issue? edit: sorry the url is set to "https://www.google.com/search?q="

def pyGoogleSearch(userInput):#Creates a list of values based off the total number of results
    newWord = url + userInput #add the url and userInput into one object
    page = requests.get(newWord)#search the word in google
    soup = BeautifulSoup(page.content,'lxml')#create a soup objects which parses the html
    search = soup.find('div',id="resultStats").text#actually search for the value
    [int(s) for s in search.split() if s.isdigit()] #convert value to a list of values, still broken up
    print(search)#debug
    return search
Skummm
  • 3
  • 3

2 Answers2

0

As others have mentioned in the comments, we dont know what your url is set to and its likely that it's either not set or set to a wrong url.

If you are looking to query sites such as wikipedia then the below solution would be of much simpler approach. It uses the URL and appends the search word to the request. Once fetched and decoded we can iterate through and find the number of times this word occurs. You can modify this and apply it for your problem.

import urllib.request

def getTopicCount(topic):
    url = "https://en.wikipedia.org/w/api.php?action=parse&section=0&prop=text&format=json&page="
    contents = urllib.request.urlopen(url+topic).read().decode('utf-8')
    count = 0
    pos = contents.find(topic)#returns when this word was encountered. if -1 its not there
    while pos != -1: #returns -1 if not found
        count += 1
        pos = contents.find(topic, pos+1)#starting posistion in the returned json request
    return count


print(getTopicCount("pizza"))//prints 146
AzyCrw4282
  • 7,222
  • 5
  • 19
  • 35
  • From your updated question, this demonstrates roughly what you want to achieve https://stackoverflow.com/questions/29377504/perform-a-google-search-and-return-the-number-of-results – AzyCrw4282 Feb 14 '20 at 02:13
0

It's because you haven't specified user-agent in HTTP requests headers. Learn more about user-agent and request headers.

Basically, user-agent let identifies the browser, its version number, and its host operating system that representing a person (browser) in a Web context that lets servers and network peers identify if it's a bot or not. Check what's your user-agent.

Pass user-agent into request headers:

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

requests.get("YOUR_URL", headers=headers)

Use select_one() instead. CSS selectors are more readable and a bit faster. CSS selectors reference.

soup.select_one('#result-stats nobr').previous_sibling
# About 107,000 results

Code and example in the online IDE:

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "fus ro dah definition",  # query
  "gl": "us",                    # country 
  "hl": "en"                     # language
}

response = requests.get('https://www.google.com/search',
                        headers=headers,
                        params=params)
soup = BeautifulSoup(response.text, 'lxml')

# .previous_sibling will go to, well, previous sibling removing unwanted part: "(0.38 seconds)"
number_of_results = soup.select_one('#result-stats nobr').previous_sibling
print(number_of_results)

# About 107,000 results

Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. it's a paid API with a free plan.

The main difference in your case is that you don't have to deal with selecting selectors to extract data or maintain parser over time since it's already done for the end-user. The only thing that needs to be done is just to get the data you want from the structured JSON.

Code to integrate:

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "fus ro dah defenition",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

result = results["search_information"]['total_results']
print(result)

# 107000

P.S - I wrote a blog post about how to scrape Google Organic Results.

Disclaimer, I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35