0

Is there any way I can scrape certain links from google result containing specific words in link. By using beautifulsoup or selenium ?

import requests 
from bs4 import BeautifulSoup 
import csv 

URL = "https://www.google.co.in/search?q=site%3Afacebook.com+friends+groups&oq=site%3Afacebook.com+friends+groups"
r = requests.get(URL) 

soup = BeautifulSoup(r.content, 'html5lib') 

Want to extract links containing group links.

Mayank
  • 73
  • 2
  • 2
  • 9
  • I am not exactly sure what you want to achieve: you can find all links by using soup.findAll('a') and get the link text by using .text. What is the 'group link' you mention? – Gregor Jan 22 '19 at 08:58
  • I am trying to extract facebook group links. from google search results. – Mayank Jan 22 '19 at 09:00
  • 2
    As it is not allowed to scrape the results I won't help you any further. You can try using what I explained in theory in the last comment. Maybe this will help you: https://stackoverflow.com/questions/22657548/is-it-ok-to-scrape-data-from-google-results – Gregor Jan 22 '19 at 09:16

1 Answers1

0

Not sure what you want to do, but if you want to extract facebook links from the returned content, you can just check whether facebook.com is within the URL:

import requests 
from bs4 import BeautifulSoup 
import csv 
URL = "https://www.google.co.in/search?q=site%3Afacebook.com+friends+groups&oq=site%3Afacebook.com+friends+groups" 
r = requests.get(URL) 
soup = BeautifulSoup(r.text, 'html5lib')
for link in soup.findAll('a', href=True): 
    if 'facebook.com' in link.get('href'):
        print link.get('href')

Update: There is another workaround. The thing you need to do is to set a legitimate user-agent. Therefore add headers to emulate a browser. :

# This is a standard user-agent of Chrome browser running on Windows 10
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}

Example:

from bs4 import BeautifulSoup 
import requests 
URL = 'https://www.google.co.in/search?q=site%3Afacebook.com+friends+groups&oq=site%3Afacebook.com+friends+groups'
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
resp = requests.get(URL, headers=headers).text 
soup = BeautifulSoup(resp, 'html.parser')
for link in soup.findAll('a', href=True): 
    if 'facebook.com' in link.get('href'):
        print link.get('href')

Additionally, you can add another set of headers to pretend like a legitimate browser. Add some more headers like this:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
    'Accept' : 
    'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language' : 'en-US,en;q=0.5',
    'Accept-Encoding' : 'gzip',
    'DNT' : '1', # Do Not Track Request Header
    'Connection' : 'close'
}
0xInfection
  • 2,676
  • 1
  • 19
  • 34
  • print link.get("href") ^ SyntaxError: invalid syntax – Mayank Jan 22 '19 at 13:25
  • @Mayank, I am not near to my laptop, I just wrote out the raw code without testing it. I'll check out the answer and error as soon as I get access to it. Meanwhile if anyone can point out where I am doing wrong, it would be good. – 0xInfection Jan 22 '19 at 13:28
  • It is not giving any error but returning null result. – Mayank Jan 22 '19 at 13:30
  • 1
    It is returning null because your IP is blocked (see the link in my comment on your initial post). You could try for yourself: get a new IP address, run your request and print the result or soup object. After a couple of requests Google will respond with a page that displays nothing but their terms. – Gregor Jan 22 '19 at 16:47
  • @Gregor I found a workaround for it. The solution is to pretend like a legitimate browser. – 0xInfection Jan 23 '19 at 06:21
  • Is there any way we can access google result description. And yes your answer worked. – Mayank Jan 23 '19 at 08:56