Navigating the html tree with BeautifulSoup and/or Selenium

Question

I've just started using BeautifulSoup and came across an obstacle at the very beginning. I looked up similar posts but didn't find a solution to my specific problem, or there is something fundamental I’m not understanding. My goal is to extract Japanese words with their English translations and examples from this page.

https://iknow.jp/courses/566921

and save them in a dataFrame or a csv file.

I am able to see the parsed output and the content of some tags, but whenever I try requesting something with a class I'm interested in, I get no results. First I’d like to get a list of the Japanese words, and I thought I should be able to do it with:

import urllib
from bs4 import BeautifulSoup

url = ["https://iknow.jp/courses/566921"]
data = []
for pg in url:
 r = urllib.request.urlopen(pg)
soup = BeautifulSoup(r,"html.parser")
soup.find_all("a", {"class": "cue"})

But I get nothing, also when I search for the response field:

responseList = soup.findAll('p', attrs={ "class" : "response"})
for word in responseList:
    print(word)

I tried moving down the tree by finding children but couldn’t get to the text I want. I will be grateful for your help. Here are the fields I'm trying to extract:

After great help from jxpython, I've now stumbed upon a new challenge (perhaps this should be a new thread, but it's quite related, so maybe it's OK here). My goal is to create a dataframe or a csv file, each row containing a Japanese word, translation and examples with transliterations. With the lists created using:

driver.find_elements_by_class_name()
driver.find_elements_by_xpath()

I get lists with different number of element, so it's not possible to easily creatre a dataframe.

# len(cues) 100
# len(responses) 100
# len(transliterations)279 stramge number because some words don't have transliterations
# len(texts) 200
# len(translations)200

The transliterations lists contains a mix of transliterations for single words and sentences. I think to be able to get content to populate the first line of my dataframe I would need to loop through the

<li class="item">

content (xpath? #/html/body/div2/div/div/section/div/section/div/div/ul/li1) and for each extract the word with translation, sentences and transliteration...I'm not sure if this would be the best approach though...

As an example, the information I would like to have in the first row of my dataframe (from the box highlighted in screenshot) is:

行く, いく, go, 日曜日は図書館に行きます。, にちようびはとしょかんにいきます。, I go to the library on Sundays.,私は夏休みにプールに行った。, わたしはなつやすみにプールにいった。, I went to the pool during summer vacation.

I can't test my code because I'm at work so I won't post an answer but from judging by what you've written you need to access the text attribute of the Beautiful Soup object. So for that last for loop instead of saying `print(word)` try `print(word.text)`. This link goes into more detail about it https://stackoverflow.com/questions/23380171/using-beautifulsoup-extract-text-without-tags — Matthew Barlowe, Sep 11 '18 at 21:21

teller.py3 · Accepted Answer · 2018-09-13T13:53:35.357

The tags you are trying to scrape are not in the source code. Probably because the page is JavaScript rendered. Try this url to see yourself:

view-source:https://iknow.jp/courses/566921

The Python module Selenium solves this problem. If you would like I could write some code for you to start on.

Here is some code to start on:

from selenium import webdriver

url = 'https://iknow.jp/courses/566921'
driver = webdriver.Chrome()
driver.get(url)
driver.implicitly_wait(2)

cues = driver.find_elements_by_class_name('cue')
cues = [cue.text for cue in cues]

responses = driver.find_elements_by_class_name('response')
responses = [response.text for response in responses]

texts = driver.find_elements_by_xpath('//*[@class="sentence-text"]/p[1]')
texts = [text.text for text in texts]

transliterations = driver.find_elements_by_class_name('transliteration')
transliterations = [transliteration.text for transliteration in transliterations]

translations = driver.find_elements_by_class_name('translation')
translations = [translation.text for translation in translations]

driver.close()

Note: You first need to install a webdriver. I choose chrome. Here is a link: https://chromedriver.storage.googleapis.com/index.html?path=2.41/. Also add this to your path! If you have any other questions let me know!

Thank you, I'm just starting and it didn't occur to me I would need to treat html and javascript differently and first check what I'm dealing with. You are right, I don't see the text I need in View Source, only when I inspect element. I'm going to read up on Selenium and try to use it, but if you have a spare moment to get me started that would be really helpful. — user3722736, Sep 12 '18 at 12:15

Navigating the html tree with BeautifulSoup and/or Selenium

1 Answers1