4

I am currently going through the Web Scraping section of AutomateTheBoringStuff and trying to write a script that extracts translated words from Google Translate using BeautifulSoup4.

I inspected the html content of a page where 'Explanation' is the translated word:

<span id="result_box" class="short_text" lang="en">  
    <span class>Explanation</span>
</span>

Using BeautifulSoup4, I tried different selectors but nothing would return the translated word. Here are a few examples I tried, but they return no results at all:

soup.select('span[id="result_box"] > span')  
soup.select('span span') 

I even copied the selector directly from the Developer Tools, which gave me #result_box > span. This again returns no results.

Can someone explain to me how to use BeautifulSoup4 for my purpose? This is my first time using BeautifulSoup4 but I think I am using BeautifulSoup more or less correctly because the selector

soup.select('span[id="result_box"]')

gets me the outer span element**

[<span class="short_text" id="result_box"></span>]

**Not sure why the 'leng="en"' part is missing but I am fairly certain I have located the correct element regardless.

Here is the complete code:

import bs4, requests

url = 'https://translate.google.ca/#zh-CN/en/%E6%B2%BB%E5%85%B7'
res = requests.get(url)
res.raise_for_status
soup = bs4.BeautifulSoup(res.text, "html.parser")
translation = soup.select('#result_box span')
print(translation)

EDIT: If I save the Google Translate page as an offline html file and then make a soup object out of that html file, there would be no problem locating the element.

import bs4

file = open("Google Translate.html")
soup = bs4.BeautifulSoup(file, "html.parser")
translation = soup.select('#result_box span')
print(translation)
Ken Lin
  • 986
  • 1
  • 8
  • 22

3 Answers3

3

The result_box div is the correct element but your code only works when you save what you see in your browser as that includes the dynamically generated content, using requests you get only the source itself bar any dynamically generated content. The translation is generated by an ajax call to the url below:

"https://translate.google.ca/translate_a/single?client=t&sl=zh-CN&tl=en&hl=en&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&ie=UTF-8&oe=UTF-8&source=bh&ssel=0&tsel=0&kc=1&tk=902911.786207&q=%E6%B2%BB%E5%85%B7"

For your requests it returns:

[[["Fixture","治具",,,0],[,,,"Zhì jù"]],,"zh-CN",,,[["治 具",1,[["Fixture",999,true,false],["Fixtures",0,true,false],["Jig",0,true,false],["Jigs",0,true,false],["Governance",0,true,false]],[[0,2]],"治具",0,1]],1,,[["ja"],,[1],["ja"]]]

So you will either have to mimic the request, passing all the necessary parameters or use something that supports dynamic content like selenium

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • Thank you for the answer. I think I will rewrite the code using selenium instead. I have a question for you though: how did you know the content was dynamically generated? – Ken Lin Jul 19 '16 at 09:27
  • @ken, the simplest way is to right click and choose view source, what you see there is what requests is going to retrieve. – Padraic Cunningham Jul 19 '16 at 09:33
  • I see, that will be very useful when I use requests in the future. A small question: does the dynamically generated content still count as a html content? – Ken Lin Jul 20 '16 at 01:54
0

Simply try this :

translation = soup.select('#result_box span')[0].text
print(translation)
akash karothiya
  • 5,736
  • 1
  • 19
  • 29
  • 1
    Hi, using that gives me the error " IndexError: list index out of range", which I believe is because there was no result in translation and thus translation[0] is not valid. – Ken Lin Jul 19 '16 at 07:32
  • remove "html.parser" and then try – akash karothiya Jul 19 '16 at 07:42
  • 1
    Hi, that was used to remove a UserWarning from bs4. Regardless, I removed it but the same error showed up. I also edited my sample code to a complete, shrunken version for other users to try debugging, so if you could fix that code and verify it on your own computer that would be great! – Ken Lin Jul 19 '16 at 07:50
  • Check what is the value in soup object, whether your html string is converting into tree structure – akash karothiya Jul 19 '16 at 07:55
  • Can you clarify how a tree structure looks like to a beginner like me in simple terms? Also, I realized if I save the page as a html file, open the html file within python, make a soup object out of that, your code would work just fine. I have added the code I used for the offline html file, please take a look. – Ken Lin Jul 19 '16 at 08:23
0

You can try this diferent aproach:

if filename.endswith(extension_file):
        with open(os.path.join(files_from_folder, filename), encoding='utf-8') as html:
            soup = BeautifulSoup('<pre>' + html.read() + '</pre>', 'html.parser')
            for title in soup.findAll('title'):
                recursively_translate(title)

FOR THE COMPLETE CODE, PLEASE SEE HERE:

https://neculaifantanaru.com/en/python-code-text-google-translate-website-translation-beautifulsoup-library.html

or HERE:

https://neculaifantanaru.com/en/example-google-translate-api-key-python-code-beautifulsoup.html

Just Me
  • 864
  • 2
  • 18
  • 28