1

I want to make a web scraper that will automatically take the translation of a word from Google Translate, in my console environment using my python script.

I saw that HTML code which I got from python Requests module, is very different from what it is on the website, you can see the differences.

After some researches about this subject, I learned from this answer that Google has some security features that won't let me have access to its HTML contents using my scripts.

But I have a chrome extension ImTranslator, which can give me translate of whatever word I selected from a web page, directly from Google Translate.

So, how this extension can do this?! Why I can't have a script that will do this for me?

I also tried using urllib for making requests and sending headers with my request.

Also, this is my code:

First, using urllib module:

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
    'Accept-Encoding': 'none',
    'Accept-Language': 'en-US,en;q=0.8',
    'Connection': 'keep-alive'
}
url = 'https://translate.google.com/#view=home&op=translate&sl=en&tl=fa&text=space'
req = urllib.request.Request(url, None, headers)
respone = urllib.request.urlopen(req)

Second, using Requests:

url = 'https://translate.google.com/#view=home&op=translate&sl=en&tl=fa&text=space'
res = requests.get(url)

1 Answers1

0

You can use googletrans module of Python to translate some text through free API.

But if you want to scrape the Google Translate then you can do the following

import requests_html
from bs4 import BeautifulSoup as BS

url = "https://translate.google.com/#view=home&op=translate&sl=en&tl=hy&text={}"

text = input("text: ")

with requests_html.HTMLSession() as session:
    response = session.get(url.format(text))
    response.html.render()
    content = response.html.html
    soup = BS(content, "html.parser")
    translation = soup.find("span", "translation").text
    print(translation)

Using asynchronous programming

import asyncio
import pyppeteer
from bs4 import BeautifulSoup as BS

URL = "https://translate.google.com/#view=home&op=translate&sl=en&tl=hy&text={}"

async def main() -> None:
    text = input("text to translate: ")
    browser = await pyppeteer.launch(headless=True)
    page = await browser.newPage()
    await page.goto(URL.format(text))
    html = await page.content()
    soup = BS(html, "html.parser")
    tranlation = soup.find("span", "translation").text
    print(tranlation)
    await browser.close()

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())
    loop.close()
Artyom Vancyan
  • 5,029
  • 3
  • 12
  • 34
  • Thanks for your answer, It Works!! but, can you explain the first code? I have no idea how did that work! I tried to find other elements from the web page but I can't. –  Jun 20 '20 at 14:58
  • 1
    First I get html content then I rendered JS because Google Translate used JS for its dynamic frontned. After it I get full content of that page and parsed it – Artyom Vancyan Jun 20 '20 at 17:00